[PATCH 1/3] scsi: introduce scsi_host_busy()

2018-04-19 Thread Ming Lei
This patch introduces SCSI middle layer API of scsi_host_busy() for
drivers to read the host-wide counter of scsi_host->host_busy.

Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Mike Snitzer 
Cc: Hannes Reinecke 
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hosts.c | 10 ++
 include/scsi/scsi_host.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 7649d63a1b8d..69beb30205f1 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -565,6 +565,16 @@ struct Scsi_Host *scsi_host_get(struct Scsi_Host *shost)
 EXPORT_SYMBOL(scsi_host_get);
 
 /**
+ * scsi_host_busy - Return the host busy counter
+ * @shost: Pointer to Scsi_Host to inc.
+ **/
+int scsi_host_busy(struct Scsi_Host *shost)
+{
+   return atomic_read(&shost->host_busy);
+}
+EXPORT_SYMBOL(scsi_host_busy);
+
+/**
  * scsi_host_put - dec a Scsi_Host ref count
  * @shost: Pointer to Scsi_Host to dec.
  **/
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 12f454cb6f61..44ab89268f30 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -758,6 +758,7 @@ extern void scsi_scan_host(struct Scsi_Host *);
 extern void scsi_rescan_device(struct device *);
 extern void scsi_remove_host(struct Scsi_Host *);
 extern struct Scsi_Host *scsi_host_get(struct Scsi_Host *);
+extern int scsi_host_busy(struct Scsi_Host *shost);
 extern void scsi_host_put(struct Scsi_Host *t);
 extern struct Scsi_Host *scsi_host_lookup(unsigned short);
 extern const char *scsi_host_state_name(enum scsi_host_state);
-- 
2.9.5



[PATCH 2/3] scsi: read host_busy via scsi_host_busy()

2018-04-19 Thread Ming Lei
No functional change, just replace the direct read of scsi_host->host_busy
with scsi_host_busy().

Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Mike Snitzer 
Cc: Hannes Reinecke 
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/advansys.c   | 8 
 drivers/scsi/libsas/sas_scsi_host.c   | 4 ++--
 drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c   | 4 ++--
 drivers/scsi/qlogicpti.c  | 2 +-
 drivers/scsi/scsi.c   | 2 +-
 drivers/scsi/scsi_error.c | 6 +++---
 drivers/scsi/scsi_sysfs.c | 2 +-
 8 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
index 24e57e770432..38ee024cbfc7 100644
--- a/drivers/scsi/advansys.c
+++ b/drivers/scsi/advansys.c
@@ -2416,8 +2416,8 @@ static void asc_prt_scsi_host(struct Scsi_Host *s)
struct asc_board *boardp = shost_priv(s);
 
printk("Scsi_Host at addr 0x%p, device %s\n", s, dev_name(boardp->dev));
-   printk(" host_busy %u, host_no %d,\n",
-  atomic_read(&s->host_busy), s->host_no);
+   printk(" host_busy %d, host_no %d,\n",
+  scsi_host_busy(s), s->host_no);
 
printk(" base 0x%lx, io_port 0x%lx, irq %d,\n",
   (ulong)s->base, (ulong)s->io_port, boardp->irq);
@@ -3182,8 +3182,8 @@ static void asc_prt_driver_conf(struct seq_file *m, 
struct Scsi_Host *shost)
shost->host_no);
 
seq_printf(m,
-  " host_busy %u, max_id %u, max_lun %llu, max_channel %u\n",
-  atomic_read(&shost->host_busy), shost->max_id,
+  " host_busy %d, max_id %u, max_lun %llu, max_channel %u\n",
+  scsi_host_busy(shost), shost->max_id,
   shost->max_lun, shost->max_channel);
 
seq_printf(m,
diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index ceab5e5c41c2..33229348dcb6 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -759,7 +759,7 @@ void sas_scsi_recover_host(struct Scsi_Host *shost)
spin_unlock_irq(shost->host_lock);
 
SAS_DPRINTK("Enter %s busy: %d failed: %d\n",
-   __func__, atomic_read(&shost->host_busy), 
shost->host_failed);
+   __func__, scsi_host_busy(shost), shost->host_failed);
/*
 * Deal with commands that still have SAS tasks (i.e. they didn't
 * complete via the normal sas_task completion mechanism),
@@ -801,7 +801,7 @@ void sas_scsi_recover_host(struct Scsi_Host *shost)
goto retry;
 
SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n",
-   __func__, atomic_read(&shost->host_busy),
+   __func__, scsi_host_busy(shost),
shost->host_failed, tries);
 }
 
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index b89c6e6c0589..008a3bdfa948 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -2822,7 +2822,7 @@ static int megasas_reset_bus_host(struct scsi_cmnd *scmd)
"SCSI command pointer: (%p)\t SCSI host state: %d\t"
" SCSI host busy: %d\t FW outstanding: %d\n",
scmd, scmd->device->host->shost_state,
-   atomic_read((atomic_t *)&scmd->device->host->host_busy),
+   scsi_host_busy(scmd->device->host),
atomic_read(&instance->fw_outstanding));
 
/*
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 61f93a134956..b7325355df8e 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3260,7 +3260,7 @@ _base_recovery_check(struct MPT3SAS_ADAPTER *ioc)
 * See _wait_for_commands_to_complete() call with regards to this code.
 */
if (ioc->shost_recovery && ioc->pending_io_count) {
-   ioc->pending_io_count = atomic_read(&ioc->shost->host_busy);
+   ioc->pending_io_count = scsi_host_busy(ioc->shost);
if (ioc->pending_io_count == 0)
wake_up(&ioc->reset_wq);
}
@@ -6658,7 +6658,7 @@ mpt3sas_wait_for_commands_to_complete(struct 
MPT3SAS_ADAPTER *ioc)
return;
 
/* pending command count */
-   ioc->pending_io_count = atomic_read(&ioc->shost->host_busy);
+   ioc->pending_io_count = scsi_host_busy(ioc->shost);
 
if (!ioc->pending_io_count)
return;
diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
index cec9a14982e6..ca685d65 100644
--- a/drivers/scsi/qlogicpti.c
+++ b/drivers/scsi/qlogicpti.c
@@ -959,7 +959,7 @@ static inline void update_can_queue(struct Scsi_Host *host, 
u_i

[PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-19 Thread Ming Lei
Hi,

This patches removes the expensive atomic opeation on host-wide counter
of .host_busy for scsi-mq, and it is observed that IOPS can be increased by
15% with this change in IO test over scsi_debug.


Ming Lei (3):
  scsi: introduce scsi_host_busy()
  scsi: read host_busy via scsi_host_busy()
  scsi: avoid to hold host-wide counter of host_busy for scsi_mq

 drivers/scsi/advansys.c   |  8 
 drivers/scsi/hosts.c  | 32 +++
 drivers/scsi/libsas/sas_scsi_host.c   |  4 ++--
 drivers/scsi/megaraid/megaraid_sas_base.c |  2 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c   |  4 ++--
 drivers/scsi/qlogicpti.c  |  2 +-
 drivers/scsi/scsi.c   |  2 +-
 drivers/scsi/scsi_error.c |  6 +++---
 drivers/scsi/scsi_lib.c   | 23 --
 drivers/scsi/scsi_sysfs.c |  2 +-
 include/scsi/scsi_host.h  |  1 +
 11 files changed, 65 insertions(+), 21 deletions(-)


Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Mike Snitzer 
Cc: Hannes Reinecke 
Cc: Laurence Oberman 

-- 
2.9.5



Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-19 Thread jianchao.wang
Hi Bart

On 04/20/2018 12:43 AM, Bart Van Assche wrote:
> Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request

Maybe use deadline to do this is not suitable.

Let's think of the following scenario.

T1/T2 times in nano seconds
J1/J2 times in jiffies

rq is started at T1,J1
blk_mq_timeout_work
  -> blk_mq_check_expired
-> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)

 rq is completed and freed 
 rq is allocated and started again 
on T2
 rq->__deadline = J2 + 
MQ_RQ_IN_FLIGHT
  -> synchronize srcu/rcu
  -> blk_mq_terminate_expired
-> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)

1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
   if J2-J1 < 4 jiffies, we will get the same deadline value.

2. even if we do some bit shift when save and get deadline
   if T2 - T1 < time of 1 jiffies, we sill get the same deadline.

Thanks
Jianchao


Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Hannes Reinecke
On 04/19/2018 09:17 PM, Jens Axboe wrote:
> On 4/19/18 1:11 PM, Michael Lyle wrote:
>> On 04/19/2018 11:10 AM, Jens Axboe wrote:
>>> On 4/19/18 11:59 AM, Michael Lyle wrote:
 Too much to do with other projects.  I've enjoyed working with everyone
 here, and hope to occasionally contribute on bcache.
>>
>> Jens---
>>
>>> It's been a pleasure having someone maintain it upstream, so this is
>>> pretty sad.
>>
>> I've really enjoyed working with everyone here.  The queue keeps
>> stacking up and I feel pretty bad when people need to wait a couple
>> weeks for reviews and assistance.
> 
> I don't blame you, and I think it is better to step down if the time
> just isn't there.
> 
>>> Do you have anyone in mind for taking over the upstream part? The
>>> community seems fairly active, and there are regular contributors.
>>>
>>> Coly and Tang Junhui come to mind.
>>
>> I think Coly would be a great maintainer.
> 
> I concur. Coly?
> 
Indeed, that would be perfect.
But I don't want to put any pressure on you, Coly :-)

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re:question about request merge

2018-04-19 Thread Zhengyuan Liu
Hi, Shaohua

I found it indeed doesn't do front merge when two threads flush plug list  
concurrently.   To 
reappear , I prepared two IO threads , named a0.io and a1.io .
Thread a1.io  uses libaio to write 5 requests : 
sectors: 16 + 8, 40 + 8, 64 + 8, 88 + 8, 112 + 8
Thread a0.io  uses libaio to write other 5 requests : 
sectors: 8+ 8, 32 + 8, 56 + 8, 80 + 8, 104 + 8

To meet the condition that thread a1.io flush all its requests to queuee before 
 thread a0.io
does, I delay thread a1.io to run queue after it adds all pluged requests to 
queue,  seeing
bellow patch:

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3324,6 +3324,21 @@ void blk_flush_plug_list(struct blk_plug *plug, bool 
from_schedule)
/*
 * This drops the queue lock
 */
+
+   if (strcmp(current->comm, "a1.io") == 0)
+   {
+   spin_unlock_irq(q->queue_lock); 
+   mdelay(2000);
+   spin_lock_irq(q->queue_lock);   
+   }
 
Then I used   " ./a1.io & sleep 1 && ./a0.io " to dispatch the IO operation and 
used blktrace to
trace the IO processing,  the trace result was showed bellow:

  8,16   1  174   176.733880160  1217  Q  WS 16 + 8 [a1.io]
  8,16   1  175   176.733885300  1217  G  WS 16 + 8 [a1.io]
  8,16   1  177   176.733890820  1217  Q  WS 40 + 8 [a1.io]
  8,16   1  178   176.733892460  1217  G  WS 40 + 8 [a1.io]
  8,16   1  179   176.733896380  1217  Q  WS 64 + 8 [a1.io]
  8,16   1  180   176.733897900  1217  G  WS 64 + 8 [a1.io]
  8,16   1  181   176.733901260  1217  Q  WS 88 + 8 [a1.io]
  8,16   1  182   176.733902760  1217  G  WS 88 + 8 [a1.io]
  8,16   1  183   176.733906200  1217  Q  WS 112 + 8 [a1.io]
  8,16   1  184   176.733907640  1217  G  WS 112 + 8 [a1.io]
  8,16   1  185   176.733909560  1217  I  WS 16 + 8 [a1.io]
  8,16   1  186   176.733925480  1217  I  WS 40 + 8 [a1.io]
  8,16   1  187   176.733936560  1217  I  WS 64 + 8 [a1.io]
  8,16   1  188   176.733945920  1217  I  WS 88 + 8 [a1.io]
  8,16   1  189   176.733953880  1217  I  WS 112 + 8 [a1.io]
  8,16   2  103   176.734317080  1218  Q  WS 8 + 8 [a0.io]
  8,16   2  104   176.734320600  1218  G  WS 8 + 8 [a0.io]
  8,16   2  106   176.734325520  1218  Q  WS 32 + 8 [a0.io]
  8,16   2  107   176.734327140  1218  G  WS 32 + 8 [a0.io]
  8,16   2  108   176.734330620  1218  Q  WS 56 + 8 [a0.io]
  8,16   2  109   176.734332140  1218  G  WS 56 + 8 [a0.io]
  8,16   2  110   176.734335380  1218  Q  WS 80 + 8 [a0.io]
  8,16   2  111   176.734336760  1218  G  WS 80 + 8 [a0.io]
  8,16   2  112   176.734340300  1218  Q  WS 104 + 8 [a0.io]
  8,16   2  113   176.734341900  1218  G  WS 104 + 8 [a0.io]
  8,16   2  114   176.734343520  1218  I  WS 8 + 8 [a0.io]
  8,16   2  115   176.734357280  1218  I  WS 32 + 8 [a0.io]
  8,16   2  116   176.734366560  1218  I  WS 56 + 8 [a0.io]
  8,16   2  117   176.734374800  1218  I  WS 80 + 8 [a0.io]
  8,16   2  118   176.734382460  1218  I  WS 104 + 8 [a0.io]
  8,16   2  120   176.738012960  1218  D  WS 112 + 8 [a0.io]
  8,16   0  151   176.740923200 3  C  WS 112 + 8 [0]
  8,16   2  121   176.741804960  1218  D  WS 88 + 8 [a0.io]
  8,16   2  122   176.741826060  1218  R  WS 88 + 8 [0]
  8,16   2  123   176.741827740  1218  I  WS 88 + 8 [a0.io]
  8,16   0  152   176.745259080 3  D  WS 88 + 8 [ksoftirqd/0]
  8,16   0  153   176.747824380 3  D  WS 64 + 8 [ksoftirqd/0]
  8,16   0  154   176.748921220 3  C  WS 88 + 8 [0]
  8,16   0  155   176.751472640 3  D  WS 40 + 8 [ksoftirqd/0]
  8,16   0  156   176.753172500 3  C  WS 64 + 8 [0]
  8,16   0  157   176.755720740 3  D  WS 16 + 8 [ksoftirqd/0]
  8,16   0  158   176.757920780 3  C  WS 40 + 8 [0]
  8,16   0  159   176.760471320 3  D  WS 104 + 8 [ksoftirqd/0]
  8,16   0  160   176.763172180 3  C  WS 16 + 8 [0]
  8,16   0  161   176.765717560 3  D  WS 80 + 8 [ksoftirqd/0]
  8,16   0  162   176.766795400 3  C  WS 104 + 8 [0]
  8,16   0  163   176.769341020 3  D  WS 56 + 8 [ksoftirqd/0]
  8,16   2  124   176.77092638018  C  WS 80 + 8 [0]
  8,16   0  164   176.774335640   924  C  WS 56 + 8 [0]
  8,16   2  125   176.77464214018  D  WS 32 + 8 [ksoftirqd/2]
  8,16   2  126   176.77465130018  R  WS 32 + 8 [0]
  8,16   2  127   176.77465290018  I  WS 32 + 8 [ksoftirqd/2]
  8,16   0  165   176.778546360   924  D  WS 32 + 8 [sshd]
  8,16   0  166   176.782498120   924  D  WS 8 + 8 [sshd]
  8,16   0  167   176.785298980 3  C  WS 32 + 8 [0]
  8,16   0  168   176.788671240 3  C  WS 8 + 8 [0]

We can see fr

Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Jens Axboe
On 4/19/18 7:25 PM, Coly Li wrote:
> On 2018/4/20 3:17 AM, Jens Axboe wrote:
>> On 4/19/18 1:11 PM, Michael Lyle wrote:
>>> On 04/19/2018 11:10 AM, Jens Axboe wrote:
 On 4/19/18 11:59 AM, Michael Lyle wrote:
> Too much to do with other projects.  I've enjoyed working with everyone
> here, and hope to occasionally contribute on bcache.
>>>
>>> Jens---
>>>
 It's been a pleasure having someone maintain it upstream, so this is
 pretty sad.
>>>
>>> I've really enjoyed working with everyone here.  The queue keeps
>>> stacking up and I feel pretty bad when people need to wait a couple
>>> weeks for reviews and assistance.
>>
>> I don't blame you, and I think it is better to step down if the time
>> just isn't there.
>>
 Do you have anyone in mind for taking over the upstream part? The
 community seems fairly active, and there are regular contributors.

 Coly and Tang Junhui come to mind.
>>>
>>> I think Coly would be a great maintainer.
>>
>> I concur. Coly?
> 
> Hi Michael and Jens,
> 
> I am glad to take the bcache maintenance. My employer is supportive for
> my bcache activities, and I am able to be stable and persistent on
> bcache project.
> 
> Thanks for your nomination.

Great news, I'll mark you as such.

-- 
Jens Axboe



Re: Change device block count from userspace?

2018-04-19 Thread Martin K. Petersen

Manuel,

> It's not a kernel fault, the drive is broken.

Interesting, OK. Just wanted to make sure we didn't have a regression. I
tweaked the capacity reading code recently.

> Still, is there a userspace option (sysfs attribute perhaps) to limit
> disk capacity to a certain block count?

I'm afraid not.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Coly Li
On 2018/4/20 3:17 AM, Jens Axboe wrote:
> On 4/19/18 1:11 PM, Michael Lyle wrote:
>> On 04/19/2018 11:10 AM, Jens Axboe wrote:
>>> On 4/19/18 11:59 AM, Michael Lyle wrote:
 Too much to do with other projects.  I've enjoyed working with everyone
 here, and hope to occasionally contribute on bcache.
>>
>> Jens---
>>
>>> It's been a pleasure having someone maintain it upstream, so this is
>>> pretty sad.
>>
>> I've really enjoyed working with everyone here.  The queue keeps
>> stacking up and I feel pretty bad when people need to wait a couple
>> weeks for reviews and assistance.
> 
> I don't blame you, and I think it is better to step down if the time
> just isn't there.
> 
>>> Do you have anyone in mind for taking over the upstream part? The
>>> community seems fairly active, and there are regular contributors.
>>>
>>> Coly and Tang Junhui come to mind.
>>
>> I think Coly would be a great maintainer.
> 
> I concur. Coly?

Hi Michael and Jens,

I am glad to take the bcache maintenance. My employer is supportive for
my bcache activities, and I am able to be stable and persistent on
bcache project.

Thanks for your nomination.

Coly Li


Re: [LSF/MM] Ride sharing

2018-04-19 Thread Dave Chinner
On Thu, Apr 19, 2018 at 01:34:32PM -0700, Matthew Wilcox wrote:
> I hate renting unnecessary cars, and the various transportation companies
> offer a better deal if multiple people book at once.
> 
> I'm scheduled to arrive on Sunday at 3:18pm local time if anyone wants to
> share transport. Does anyone have a wiki we can use to coordinate this?

Arriving 4.15pm sunday, so if you want to wait around for a bit I'm
happy to share...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 05:47:51PM -0400, Jerome Glisse wrote:
> On Thu, Apr 19, 2018 at 10:21:37PM +0100, Al Viro wrote:
> > On Thu, Apr 19, 2018 at 04:58:20PM -0400, Jerome Glisse wrote:
> > 
> > > I need a struct to link part of device context with mm struct for a
> > > process. Most of device context is link to the struct file of the
> > > device file (ie process open has a file descriptor for the device
> > > file).
> > 
> > Er...  You do realize that
> > fd = open(...)
> > mmap(fd, ...)
> > close(fd)
> > is absolutely legitimate, right?  IOW, existence/stability/etc. of
> > a file descriptor is absolutely *not* guaranteed - in fact, there might
> > be not a single file descriptor referring to a given openen and mmaped
> > file.
> 
> Yes and that's fine, on close(fd) the device driver is tear down

No, it is not.  _NOTHING_ is done on that close(fd), other than removing
a reference from descriptor table.  In this case struct file is still
open and remains such until munmap().

That's why descriptor table is a very bad place for sticking that kind of
information.  Besides, as soon as your syscall (ioctl, write, whatever)
has looked struct file up, the mapping from descriptors to struct file
can change.  Literally before fdget() has returned to caller.  Another
thread could do dup() and close() of the original descriptor.  Or
just plain close(), for that matter - struct file will remain open until
fdput().

> and
> struct i want to store is tear down too and free.

So do a hash table indexed by pair (void *, struct mm_struct *) and
do lookups there...  And use radeon_device as the first half of the
key.  Or struct file *, or pointer to whatever private data you maintain
for an opened file...


Re: [PATCH] bsg referencing bus driver module

2018-04-19 Thread Anatoliy Glagolev
Updated: rebased on recent Linux, cc-ed maintainers per instructions
in MAINTAINERS file

>From df939b80d02bf37b21efaaef8ede86cfd39b0cb8 Mon Sep 17 00:00:00 2001
From: Anatoliy Glagolev 
Date: Thu, 19 Apr 2018 15:06:06 -0600
Subject: [PATCH] bsg referencing parent module

Fixing a bug when bsg layer holds the last reference to device
when the device's module has been unloaded. Upon dropping the
reference the device's release function may touch memory of
the unloaded module.
---
 block/bsg-lib.c  | 24 ++--
 block/bsg.c  | 29 ++---
 drivers/scsi/scsi_transport_fc.c |  8 ++--
 include/linux/bsg-lib.h  |  4 
 include/linux/bsg.h  |  5 +
 5 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index fc2e5ff..90c28fd 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -309,6 +309,25 @@ struct request_queue *bsg_setup_queue(struct
device *dev, const char *name,
  bsg_job_fn *job_fn, int dd_job_size,
  void (*release)(struct device *))
 {
+ return bsg_setup_queue_ex(dev, name, job_fn, dd_job_size, release,
+ NULL);
+}
+EXPORT_SYMBOL_GPL(bsg_setup_queue);
+
+/**
+ * bsg_setup_queue - Create and add the bsg hooks so we can receive requests
+ * @dev: device to attach bsg device to
+ * @name: device to give bsg device
+ * @job_fn: bsg job handler
+ * @dd_job_size: size of LLD data needed for each job
+ * @release: @dev release function
+ * @dev_module: @dev's module
+ */
+struct request_queue *bsg_setup_queue_ex(struct device *dev, const char *name,
+ bsg_job_fn *job_fn, int dd_job_size,
+ void (*release)(struct device *),
+ struct module *dev_module)
+{
  struct request_queue *q;
  int ret;

@@ -331,7 +350,8 @@ struct request_queue *bsg_setup_queue(struct
device *dev, const char *name,
  blk_queue_softirq_done(q, bsg_softirq_done);
  blk_queue_rq_timeout(q, BLK_DEFAULT_SG_TIMEOUT);

- ret = bsg_register_queue(q, dev, name, &bsg_transport_ops, release);
+ ret = bsg_register_queue_ex(q, dev, name, &bsg_transport_ops, release,
+ dev_module);
  if (ret) {
  printk(KERN_ERR "%s: bsg interface failed to "
"initialize - register queue\n", dev->kobj.name);
@@ -343,4 +363,4 @@ struct request_queue *bsg_setup_queue(struct
device *dev, const char *name,
  blk_cleanup_queue(q);
  return ERR_PTR(ret);
 }
-EXPORT_SYMBOL_GPL(bsg_setup_queue);
+EXPORT_SYMBOL_GPL(bsg_setup_queue_ex);
diff --git a/block/bsg.c b/block/bsg.c
index defa06c..6920c5b 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -750,7 +750,8 @@ static struct bsg_device *__bsg_get_device(int
minor, struct request_queue *q)
  return bd;
 }

-static struct bsg_device *bsg_get_device(struct inode *inode, struct
file *file)
+static struct bsg_device *bsg_get_device(struct inode *inode, struct
file *file,
+ struct bsg_class_device **pbcd)
 {
  struct bsg_device *bd;
  struct bsg_class_device *bcd;
@@ -766,6 +767,7 @@ static struct bsg_device *bsg_get_device(struct
inode *inode, struct file *file)

  if (!bcd)
  return ERR_PTR(-ENODEV);
+ *pbcd = bcd;

  bd = __bsg_get_device(iminor(inode), bcd->queue);
  if (bd)
@@ -781,22 +783,34 @@ static struct bsg_device *bsg_get_device(struct
inode *inode, struct file *file)
 static int bsg_open(struct inode *inode, struct file *file)
 {
  struct bsg_device *bd;
+ struct bsg_class_device *bcd;

- bd = bsg_get_device(inode, file);
+ bd = bsg_get_device(inode, file, &bcd);

  if (IS_ERR(bd))
  return PTR_ERR(bd);

  file->private_data = bd;
+ if (bcd->parent_module) {
+ if (!try_module_get(bcd->parent_module)) {
+ bsg_put_device(bd);
+ return -ENODEV;
+ }
+ }
  return 0;
 }

 static int bsg_release(struct inode *inode, struct file *file)
 {
+ int ret;
  struct bsg_device *bd = file->private_data;
+ struct module *parent_module = bd->queue->bsg_dev.parent_module;

  file->private_data = NULL;
- return bsg_put_device(bd);
+ ret = bsg_put_device(bd);
+ if (parent_module)
+ module_put(parent_module);
+ return ret;
 }

 static __poll_t bsg_poll(struct file *file, poll_table *wait)
@@ -922,6 +936,14 @@ int bsg_register_queue(struct request_queue *q,
struct device *parent,
  const char *name, const struct bsg_ops *ops,
  void (*release)(struct device *))
 {
+ return bsg_register_queue_ex(q, parent, name, ops, release, NULL);
+}
+
+int bsg_register_queue_ex(struct request_queue *q, struct device *parent,
+ const char *name, const struct bsg_ops *ops,
+ void (*release)(struct device *),
+ struct module *parent_module)
+{
  struct bsg_class_device *bcd;
  dev_t dev;
  int ret;
@@ -958,6 +980,7 @@ int bsg_register_queue(struct request_queue *q,
struct device *parent,
  bcd->parent = get_device(parent);
  bcd->release = release;
  bcd->ops = ops;
+ bcd->parent_module = parent_module;
  kref_init(&bcd->ref);
  dev = MKDEV(bsg_major, bcd->minor);
  class_dev = device_create(bsg_class, parent, dev, NULL, "%s", devname);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_

Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 10:21:37PM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 04:58:20PM -0400, Jerome Glisse wrote:
> 
> > I need a struct to link part of device context with mm struct for a
> > process. Most of device context is link to the struct file of the
> > device file (ie process open has a file descriptor for the device
> > file).
> 
> Er...  You do realize that
>   fd = open(...)
>   mmap(fd, ...)
>   close(fd)
> is absolutely legitimate, right?  IOW, existence/stability/etc. of
> a file descriptor is absolutely *not* guaranteed - in fact, there might
> be not a single file descriptor referring to a given openen and mmaped
> file.

Yes and that's fine, on close(fd) the device driver is tear down and
struct i want to store is tear down too and free.

> 
> > Device driver for GPU have some part of their process context tied to
> > the process mm (accessing process address space directly from the GPU).
> > However we can not store this context information in the struct file
> > private data because of clone (same struct file accross different mm).
> > 
> > So today driver have an hashtable in their global device structure to
> > lookup context information for a given mm. This is sub-optimal and
> > duplicate a lot of code among different drivers.
> 
> Umm...  Examples?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_mn.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_userptr.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c

RDMA folks too have similar construct.

> 
> > Hence why i want something generic that allow a device driver to store
> > context structure that is specific to a mm. I thought that adding a
> > new array on the side of struct file array would be a good idea but it
> > has too many kludges.
> > 
> > So i will do something inside mmu_notifier and there will be no tie to
> > any fs aspect. I expect only a handful of driver to care about this and
> > for a given platform you won't see that many devices hence you won't
> > have that many pointer to deal with.
> 
> Let's step back for a second - lookups by _what_?  If you are associating
> somethin with a mapping, vm_area_struct would be a natural candidate for
> storing such data, wouldn't it?
> 
> What do you have and what do you want to find?

So you are in an ioctl against the device file, you have struct file
and driver store a pointer to some file context info in struct file
private data which itself has a pointer to some global device driver
structure which itself has a pointer to struct device.

Hence i have struct mm (from current->mm), and dev_t easily available.

The context information is tie to the mm for the device and can only
be use against said mm. Even if the struct file of the device outlive
the original process, no one can use that struct with a process that
do not have the same mm. Moreover that struct is freed if the mm is
destroy.

If child, share the struct file but have a different and want to use
same feature then a new structure is created and has same property ie
can only be use against this new mm.

The link with struct file is not explicit but you can only use objects
tie to that struct through ioctl against the struct file.

Hopes this clarify the use case.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 04:58:20PM -0400, Jerome Glisse wrote:

> I need a struct to link part of device context with mm struct for a
> process. Most of device context is link to the struct file of the
> device file (ie process open has a file descriptor for the device
> file).

Er...  You do realize that
fd = open(...)
mmap(fd, ...)
close(fd)
is absolutely legitimate, right?  IOW, existence/stability/etc. of
a file descriptor is absolutely *not* guaranteed - in fact, there might
be not a single file descriptor referring to a given openen and mmaped
file.

> Device driver for GPU have some part of their process context tied to
> the process mm (accessing process address space directly from the GPU).
> However we can not store this context information in the struct file
> private data because of clone (same struct file accross different mm).
> 
> So today driver have an hashtable in their global device structure to
> lookup context information for a given mm. This is sub-optimal and
> duplicate a lot of code among different drivers.

Umm...  Examples?

> Hence why i want something generic that allow a device driver to store
> context structure that is specific to a mm. I thought that adding a
> new array on the side of struct file array would be a good idea but it
> has too many kludges.
> 
> So i will do something inside mmu_notifier and there will be no tie to
> any fs aspect. I expect only a handful of driver to care about this and
> for a given platform you won't see that many devices hence you won't
> have that many pointer to deal with.

Let's step back for a second - lookups by _what_?  If you are associating
somethin with a mapping, vm_area_struct would be a natural candidate for
storing such data, wouldn't it?

What do you have and what do you want to find?


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 09:39:53PM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 01:25:13PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 04:15:02PM -0400, Jerome Glisse wrote:
> > > On Thu, Apr 19, 2018 at 12:56:37PM -0700, Matthew Wilcox wrote:
> > > > > Well scratch that whole idea, i would need to add a new array to task
> > > > > struct which make it a lot less appealing. Hence a better solution is
> > > > > to instead have this as part of mm (well indirectly).
> > > > 
> > > > It shouldn't be too bad to add a struct radix_tree to the fdtable.
> > > > 
> > > > I'm sure we could just not support weird cases like sharing the fdtable
> > > > without sharing the mm.  Does anyone actually do that?
> > > 
> > > Well like you pointed out what i really want is a 1:1 structure linking
> > > a device struct an a mm_struct. Given that this need to be cleanup when
> > > mm goes away hence tying this to mmu_notifier sounds like a better idea.
> > > 
> > > I am thinking of adding a hashtable to mmu_notifier_mm using file id for
> > > hash as this should be a good hash value for common cases. I only expect
> > > few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
> > > a hashtable inside their driver and they has on the mm struct pointer,
> > > i believe hash mmu_notifier_mm using file id will be better.
> > 
> > file descriptors are small positive integers ...
> 
> ... except when there's a lot of them.  Or when something uses dup2() in
> interesting ways, but hey - we could "just not support" that, right?
> 
> > ideal for the radix tree.
> > If you need to find your data based on the struct file address, then by
> > all means a hashtable is the better data structure.
> 
> Perhaps it would be a good idea to describe whatever is being attempted?
> 
> FWIW, passing around descriptors is almost always a bloody bad idea.  There
> are very few things really associated with those and just about every time
> I'd seen internal APIs that work in terms of those "small positive numbers"
> they had been badly racy and required massive redesign to get something even
> remotely sane.

Ok i will use struct device pointer as index, or something else (i
would like to use PCI domain:bus:slot but i don't want this to be
PCIE only), maybe dev_t ...

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 09:33:07PM +0100, Al Viro wrote:
> On Thu, Apr 19, 2018 at 01:26:10PM -0400, Jerome Glisse wrote:
> 
> > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > add void * *private_data; to struct fdtable (also a default array to
> > struct files_struct). The callback would be part of struct file_operations.
> > and only call if it exist (os overhead is only for device driver that
> > care).
> 
> Hell, *NO*.  This is insane - you would need to maintain extra counts
> ("how many descriptors refer to this struct file... for this descriptor
> table").
> 
> Besides, _what_ private_data?  What would own and maintain it?  A specific
> driver?  What if more than one of them wants that thing?

I hadn't something complex in mind (ie timelife link to struct file and
no refcouting changes). But anyway i gave up on that idea and will add
what i need in mmu_notifier.

>  
> > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > should be all set here.
> 
> That looks like an extremely misguided kludge for hell knows what purpose,
> almost certainly architecturally insane.  What are you actually trying to
> achieve?

I need a struct to link part of device context with mm struct for a
process. Most of device context is link to the struct file of the
device file (ie process open has a file descriptor for the device
file).

Device driver for GPU have some part of their process context tied to
the process mm (accessing process address space directly from the GPU).
However we can not store this context information in the struct file
private data because of clone (same struct file accross different mm).

So today driver have an hashtable in their global device structure to
lookup context information for a given mm. This is sub-optimal and
duplicate a lot of code among different drivers.

Hence why i want something generic that allow a device driver to store
context structure that is specific to a mm. I thought that adding a
new array on the side of struct file array would be a good idea but it
has too many kludges.

So i will do something inside mmu_notifier and there will be no tie to
any fs aspect. I expect only a handful of driver to care about this and
for a given platform you won't see that many devices hence you won't
have that many pointer to deal with.

Cheers,
Jérôme


Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events

2018-04-19 Thread Bart Van Assche
On Thu, 2018-04-19 at 12:24 -0700, Omar Sandoval wrote:
> We quiesce and freeze the queue before tearing it down, so it won't be
> NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you,
> though ;)

blk_cleanup_queue() waits until all I/O requests have finished. Since the
block layer tracing code is triggered from inside the code that processes a
request it is safe to access the request pointer from inside the tracing code.
But I think the question was about the parent of the request queue kobj
instead of about the request queue pointer, the device structure that is
embedded in struct gendisk? I think that parent can disappear at any time.
Most block drivers call del_gendisk() before they call blk_cleanup_queue().
Unless I'm overlooking something I think the request queue will need to
obtain a reference to the gendisk device from inside blk_register_queue()
and drop that reference from inside blk_cleanup_queue() to make it safe for
the tracing code to access struct gendisk.

Bart.





Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 04:15:02PM -0400, Jerome Glisse wrote:

> Well like you pointed out what i really want is a 1:1 structure linking
> a device struct an a mm_struct. Given that this need to be cleanup when
> mm goes away hence tying this to mmu_notifier sounds like a better idea.
> 
> I am thinking of adding a hashtable to mmu_notifier_mm using file id for
> hash as this should be a good hash value for common cases. I only expect
> few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
> a hashtable inside their driver and they has on the mm struct pointer,
> i believe hash mmu_notifier_mm using file id will be better.

What _is_ "file id"?  If you are talking about file descriptors, you can
very well have several for the same opened file.  Moreover, you can
bloody well have it opened, then dup'ed, then original descriptor closed
and reused by another open...


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 01:25:13PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 04:15:02PM -0400, Jerome Glisse wrote:
> > On Thu, Apr 19, 2018 at 12:56:37PM -0700, Matthew Wilcox wrote:
> > > > Well scratch that whole idea, i would need to add a new array to task
> > > > struct which make it a lot less appealing. Hence a better solution is
> > > > to instead have this as part of mm (well indirectly).
> > > 
> > > It shouldn't be too bad to add a struct radix_tree to the fdtable.
> > > 
> > > I'm sure we could just not support weird cases like sharing the fdtable
> > > without sharing the mm.  Does anyone actually do that?
> > 
> > Well like you pointed out what i really want is a 1:1 structure linking
> > a device struct an a mm_struct. Given that this need to be cleanup when
> > mm goes away hence tying this to mmu_notifier sounds like a better idea.
> > 
> > I am thinking of adding a hashtable to mmu_notifier_mm using file id for
> > hash as this should be a good hash value for common cases. I only expect
> > few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
> > a hashtable inside their driver and they has on the mm struct pointer,
> > i believe hash mmu_notifier_mm using file id will be better.
> 
> file descriptors are small positive integers ...

... except when there's a lot of them.  Or when something uses dup2() in
interesting ways, but hey - we could "just not support" that, right?

> ideal for the radix tree.
> If you need to find your data based on the struct file address, then by
> all means a hashtable is the better data structure.

Perhaps it would be a good idea to describe whatever is being attempted?

FWIW, passing around descriptors is almost always a bloody bad idea.  There
are very few things really associated with those and just about every time
I'd seen internal APIs that work in terms of those "small positive numbers"
they had been badly racy and required massive redesign to get something even
remotely sane.


[LSF/MM] Ride sharing

2018-04-19 Thread Matthew Wilcox
I hate renting unnecessary cars, and the various transportation companies
offer a better deal if multiple people book at once.

I'm scheduled to arrive on Sunday at 3:18pm local time if anyone wants to
share transport.  Does anyone have a wiki we can use to coordinate this?



Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 01:26:10PM -0400, Jerome Glisse wrote:

> Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> add void * *private_data; to struct fdtable (also a default array to
> struct files_struct). The callback would be part of struct file_operations.
> and only call if it exist (os overhead is only for device driver that
> care).

Hell, *NO*.  This is insane - you would need to maintain extra counts
("how many descriptors refer to this struct file... for this descriptor
table").

Besides, _what_ private_data?  What would own and maintain it?  A specific
driver?  What if more than one of them wants that thing?
 
> Did i miss something fundamental ? copy_files() call dup_fd() so i
> should be all set here.

That looks like an extremely misguided kludge for hell knows what purpose,
almost certainly architecturally insane.  What are you actually trying to
achieve?


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Matthew Wilcox
On Thu, Apr 19, 2018 at 04:15:02PM -0400, Jerome Glisse wrote:
> On Thu, Apr 19, 2018 at 12:56:37PM -0700, Matthew Wilcox wrote:
> > > Well scratch that whole idea, i would need to add a new array to task
> > > struct which make it a lot less appealing. Hence a better solution is
> > > to instead have this as part of mm (well indirectly).
> > 
> > It shouldn't be too bad to add a struct radix_tree to the fdtable.
> > 
> > I'm sure we could just not support weird cases like sharing the fdtable
> > without sharing the mm.  Does anyone actually do that?
> 
> Well like you pointed out what i really want is a 1:1 structure linking
> a device struct an a mm_struct. Given that this need to be cleanup when
> mm goes away hence tying this to mmu_notifier sounds like a better idea.
> 
> I am thinking of adding a hashtable to mmu_notifier_mm using file id for
> hash as this should be a good hash value for common cases. I only expect
> few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
> a hashtable inside their driver and they has on the mm struct pointer,
> i believe hash mmu_notifier_mm using file id will be better.

file descriptors are small positive integers ... ideal for the radix tree.
If you need to find your data based on the struct file address, then by
all means a hashtable is the better data structure.


Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Omar Sandoval
On Thu, Apr 19, 2018 at 01:44:41PM -0600, Jens Axboe wrote:
> On 4/19/18 1:41 PM, Bart Van Assche wrote:
> > On Thu, 2018-04-19 at 12:13 -0700, Omar Sandoval wrote:
> >> On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
> >>> Thanks for the test! Applied.
> >>
> >> Side note, it's unfortunate that this test takes 180 seconds to run only
> >> because we have to wait for the command timeout. We should be able to
> >> export request_queue->rq_timeout writeable in sysfs. Would you be
> >> interested in doing that?
> > 
> > Hello Omar,
> > 
> > Is this perhaps what you are looking for?
> > # ls -l /sys/class/scsi_device/*/*/timeout 
> > -rw-r--r-- 1 root root 4096 Apr 19 08:52 
> > /sys/class/scsi_device/2:0:0:0/device/timeout
> > -rw-r--r-- 1 root root 4096 Apr 19 12:39 
> > /sys/class/scsi_device/8:0:0:1/device/timeout
> 
> We should have it generically available though, not just for SCSI. In
> retrospect, it should have been under queue/ from the start, now we'll
> end up with duplicate entries for SCSI.

For the sake of this test, I just decreased the timeout through SCSI.


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 12:56:37PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 03:31:08PM -0400, Jerome Glisse wrote:
> > > > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > > > add void * *private_data; to struct fdtable (also a default array to
> > > > struct files_struct). The callback would be part of struct 
> > > > file_operations.
> > > > and only call if it exist (os overhead is only for device driver that
> > > > care).
> > > > 
> > > > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > > > should be all set here.
> > > > 
> > > > I will work on patches i was hoping this would not be too much work.
> > 
> > Well scratch that whole idea, i would need to add a new array to task
> > struct which make it a lot less appealing. Hence a better solution is
> > to instead have this as part of mm (well indirectly).
> 
> It shouldn't be too bad to add a struct radix_tree to the fdtable.
> 
> I'm sure we could just not support weird cases like sharing the fdtable
> without sharing the mm.  Does anyone actually do that?

Well like you pointed out what i really want is a 1:1 structure linking
a device struct an a mm_struct. Given that this need to be cleanup when
mm goes away hence tying this to mmu_notifier sounds like a better idea.

I am thinking of adding a hashtable to mmu_notifier_mm using file id for
hash as this should be a good hash value for common cases. I only expect
few drivers to need that (GPU drivers, RDMA). Today GPU drivers do have
a hashtable inside their driver and they has on the mm struct pointer,
i believe hash mmu_notifier_mm using file id will be better.

Jérôme


Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Jens Axboe
On 4/19/18 1:13 PM, Omar Sandoval wrote:
> On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
>> Thanks for the test! Applied.
> 
> Side note, it's unfortunate that this test takes 180 seconds to run only
> because we have to wait for the command timeout. We should be able to
> export request_queue->rq_timeout writeable in sysfs. Would you be
> interested in doing that?

Here's one, I even tested it.


diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d00d1b0ec109..d5d628c3c12d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -496,6 +496,27 @@ static ssize_t queue_dax_show(struct request_queue *q, 
char *page)
return queue_var_show(blk_queue_dax(q), page);
 }
 
+static ssize_t queue_timeout_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, "%u\n", q->rq_timeout / HZ);
+}
+
+static ssize_t queue_timeout_store(struct request_queue *q, const char *page,
+  size_t count)
+{
+   unsigned long timeout;
+   ssize_t ret;
+
+   ret = queue_var_store(&timeout, page, count);
+   if (ret < 0)
+   return ret;
+   if (!timeout)
+   return -EINVAL;
+
+   blk_queue_rq_timeout(q, timeout * HZ);
+   return count;
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -678,6 +699,12 @@ static struct queue_sysfs_entry throtl_sample_time_entry = 
{
 };
 #endif
 
+static struct queue_sysfs_entry queue_timeout_entry = {
+   .attr = {.name = "timeout", .mode = S_IRUGO | S_IWUSR },
+   .show = queue_timeout_show,
+   .store = queue_timeout_store,
+};
+
 static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -714,6 +741,7 @@ static struct attribute *default_attrs[] = {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
&throtl_sample_time_entry.attr,
 #endif
+   &queue_timeout_entry.attr,
NULL,
 };
 

-- 
Jens Axboe



Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Matthew Wilcox
On Thu, Apr 19, 2018 at 03:31:08PM -0400, Jerome Glisse wrote:
> > > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > > add void * *private_data; to struct fdtable (also a default array to
> > > struct files_struct). The callback would be part of struct 
> > > file_operations.
> > > and only call if it exist (os overhead is only for device driver that
> > > care).
> > > 
> > > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > > should be all set here.
> > > 
> > > I will work on patches i was hoping this would not be too much work.
> 
> Well scratch that whole idea, i would need to add a new array to task
> struct which make it a lot less appealing. Hence a better solution is
> to instead have this as part of mm (well indirectly).

It shouldn't be too bad to add a struct radix_tree to the fdtable.

I'm sure we could just not support weird cases like sharing the fdtable
without sharing the mm.  Does anyone actually do that?


Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Jens Axboe
On 4/19/18 1:41 PM, Bart Van Assche wrote:
> On Thu, 2018-04-19 at 12:13 -0700, Omar Sandoval wrote:
>> On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
>>> Thanks for the test! Applied.
>>
>> Side note, it's unfortunate that this test takes 180 seconds to run only
>> because we have to wait for the command timeout. We should be able to
>> export request_queue->rq_timeout writeable in sysfs. Would you be
>> interested in doing that?
> 
> Hello Omar,
> 
> Is this perhaps what you are looking for?
> # ls -l /sys/class/scsi_device/*/*/timeout 
> -rw-r--r-- 1 root root 4096 Apr 19 08:52 
> /sys/class/scsi_device/2:0:0:0/device/timeout
> -rw-r--r-- 1 root root 4096 Apr 19 12:39 
> /sys/class/scsi_device/8:0:0:1/device/timeout

We should have it generically available though, not just for SCSI. In
retrospect, it should have been under queue/ from the start, now we'll
end up with duplicate entries for SCSI.

-- 
Jens Axboe



Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Bart Van Assche
On Thu, 2018-04-19 at 12:13 -0700, Omar Sandoval wrote:
> On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
> > Thanks for the test! Applied.
> 
> Side note, it's unfortunate that this test takes 180 seconds to run only
> because we have to wait for the command timeout. We should be able to
> export request_queue->rq_timeout writeable in sysfs. Would you be
> interested in doing that?

Hello Omar,

Is this perhaps what you are looking for?
# ls -l /sys/class/scsi_device/*/*/timeout 
-rw-r--r-- 1 root root 4096 Apr 19 08:52 
/sys/class/scsi_device/2:0:0:0/device/timeout
-rw-r--r-- 1 root root 4096 Apr 19 12:39 
/sys/class/scsi_device/8:0:0:1/device/timeout

Thanks,

Bart.





Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 02:31:07PM -0400, Jeff Layton wrote:
> On Thu, 2018-04-19 at 13:26 -0400, Jerome Glisse wrote:
> > On Thu, Apr 19, 2018 at 12:58:39PM -0400, Jeff Layton wrote:
> > > On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> > > > On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > > > > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > > > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > > > > any people against having a callback everytime a struct file is 
> > > > > > added
> > > > > > to a task_struct and also having a secondary array so that special
> > > > > > file like device file can store something opaque per task_struct per
> > > > > > struct file.
> > > > > 
> > > > > Do you really want something per _thread_, and not per _mm_?
> > > > 
> > > > Well per mm would be fine but i do not see how to make that happen with
> > > > reasonable structure. So issue is that you can have multiple task with
> > > > same mm but different file descriptors (or am i wrong here ?) thus there
> > > > would be no easy way given a struct file to lookup the per mm struct.
> > > > 
> > > > So as a not perfect solution i see a new array in filedes which would
> > > > allow device driver to store a pointer to their per mm data structure.
> > > > To be fair usualy you will only have a single fd in a single task for
> > > > a given device.
> > > > 
> > > > If you see an easy way to get a per mm per inode pointer store somewhere
> > > > with easy lookup i am all ears :)
> > > > 
> > > 
> > > I may be misunderstanding, but to be clear: struct files don't get
> > > added to a thread, per-se.
> > > 
> > > When userland calls open() or similar, the struct file gets added to
> > > the files_struct. Those are generally shared with other threads within
> > > the same process. The files_struct can also be shared with other
> > > processes if you clone() with the right flags.
> > > 
> > > Doing something per-thread on every open may be rather difficult to do.
> > 
> > Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> > add void * *private_data; to struct fdtable (also a default array to
> > struct files_struct). The callback would be part of struct file_operations.
> > and only call if it exist (os overhead is only for device driver that
> > care).
> > 
> > Did i miss something fundamental ? copy_files() call dup_fd() so i
> > should be all set here.
> > 
> > I will work on patches i was hoping this would not be too much work.
> > 
> 
> No, I think I misunderstood. I was thinking you wanted to iterate over
> all of the threads that might be associated with a struct file, and
> that's rather non-trivial.
> 
> A callback when you add a file to the files_struct seems like it would
> probably be OK (in principle).

Well scratch that whole idea, i would need to add a new array to task
struct which make it a lot less appealing. Hence a better solution is
to instead have this as part of mm (well indirectly).

Thanks folks for chimming in. I will discuss this in my mmu_notifier
session.

Cheers,
Jérôme


Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events

2018-04-19 Thread Omar Sandoval
On Mon, Apr 16, 2018 at 06:33:27PM +0200, Steffen Maier wrote:
> Hi Greg,
> 
> On 04/15/2018 10:31 AM, Greg Kroah-Hartman wrote:
> > On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:
> > > diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> > > index a13613d27cee..cffedc26e8a3 100644
> > > --- a/include/trace/events/block.h
> > > +++ b/include/trace/events/block.h
> > > @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
> > >   TP_ARGS(q),
> > >   TP_STRUCT__entry(
> > > + __field( dev_t, dev )
> > >   __array( char,  comm,   TASK_COMM_LEN   )
> > >   ),
> > >   TP_fast_assign(
> > > + __entry->dev = q->kobj.parent ?
> > > + container_of(q->kobj.parent, struct device, kobj)->devt : 0;
> > 
> > That really really really scares me.  It feels very fragile and messing
> > with parent pointers is ripe for things breaking in the future in odd
> > and unexplainable ways.
> > 
> > And how can the parent be NULL?
> 
> I'm hoping for help by block layer experts.
> 
> I suppose block device unplug/removal could be a case. But I don't know the
> details how this works and if object release is protected while I/O is
> pending and new I/O is rejected beforehand. That might make my approach safe
> as it would not call the trace functions while the parent pointer changes.

We quiesce and freeze the queue before tearing it down, so it won't be
NULL while we're still doing I/O. Cc'ing Bart in case I'm lying to you,
though ;)

> > I don't know the block layer but this feels very wrong to me.  Are you
> > sure there isn't some other way to get this info?
> 
> No, I'm not sure at all. But I'm no block layer expert either. This is just
> an idea I had which did work for my cases and I'm looking for confirmation
> or denial by the experts. So if it's not safe from a block layer point of
> view either, then I have to ditch it.

There's not a pretty way to do this, but using the proper helpers would
be preferred:

disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)))

The parent of a request_queue is always a gendisk, so this should always
work.

> -- 
> Mit freundlichen Grüßen / Kind regards
> Steffen Maier
> 
> Linux on z Systems Development
> 
> IBM Deutschland Research & Development GmbH
> Vorsitzende des Aufsichtsrats: Martina Koederitz
> Geschaeftsfuehrung: Dirk Wittkopp
> Sitz der Gesellschaft: Boeblingen
> Registergericht: Amtsgericht Stuttgart, HRB 243294
> 


Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Jens Axboe
On 4/19/18 1:11 PM, Michael Lyle wrote:
> On 04/19/2018 11:10 AM, Jens Axboe wrote:
>> On 4/19/18 11:59 AM, Michael Lyle wrote:
>>> Too much to do with other projects.  I've enjoyed working with everyone
>>> here, and hope to occasionally contribute on bcache.
> 
> Jens---
> 
>> It's been a pleasure having someone maintain it upstream, so this is
>> pretty sad.
> 
> I've really enjoyed working with everyone here.  The queue keeps
> stacking up and I feel pretty bad when people need to wait a couple
> weeks for reviews and assistance.

I don't blame you, and I think it is better to step down if the time
just isn't there.

>> Do you have anyone in mind for taking over the upstream part? The
>> community seems fairly active, and there are regular contributors.
>>
>> Coly and Tang Junhui come to mind.
> 
> I think Coly would be a great maintainer.

I concur. Coly?

-- 
Jens Axboe



Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Omar Sandoval
On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
> Thanks for the test! Applied.

Side note, it's unfortunate that this test takes 180 seconds to run only
because we have to wait for the command timeout. We should be able to
export request_queue->rq_timeout writeable in sysfs. Would you be
interested in doing that?


Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Michael Lyle
On 04/19/2018 11:10 AM, Jens Axboe wrote:
> On 4/19/18 11:59 AM, Michael Lyle wrote:
>> Too much to do with other projects.  I've enjoyed working with everyone
>> here, and hope to occasionally contribute on bcache.

Jens---

> It's been a pleasure having someone maintain it upstream, so this is
> pretty sad.

I've really enjoyed working with everyone here.  The queue keeps
stacking up and I feel pretty bad when people need to wait a couple
weeks for reviews and assistance.

> Do you have anyone in mind for taking over the upstream part? The
> community seems fairly active, and there are regular contributors.
> 
> Coly and Tang Junhui come to mind.

I think Coly would be a great maintainer.

Mike


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-19 Thread Jens Axboe
On 4/19/18 10:43 AM, Bart Van Assche wrote:
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.
> 
> Fix this race as follows:
> - Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request.
> - Store the request state in the lowest two bits of the deadline instead
>   of the lowest two bits of 'gstate'.
> - Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
>   enumeration member into a #define such that its type can be changed
>   into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
>   ~(unsigned long)RQ_STATE_MASK.
> - Remove all request member variables that became superfluous due to
>   this change: gstate, gstate_seq and aborted_gstate_sync.
> - Remove the request state information that became superfluous due to this
>   patch, namely RQF_MQ_TIMEOUT_EXPIRED.
> - Remove the code that became superfluous due to this change, namely
>   the RCU lock and unlock statements in blk_mq_complete_request() and
>   also the synchronize_rcu() call in the timeout handler.

I like this approach, it's simpler and easier to understand. Net
reduction in code is nice too. I've run it through the testing and it
passes. Unless people holler loudly, I'd like to include this in 4.17.

-- 
Jens Axboe



Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Omar Sandoval
On Tue, Apr 17, 2018 at 11:03:37AM +0200, Steffen Maier wrote:
> Signed-off-by: Steffen Maier 
> ---
>  tests/scsi/004 |   59 
> 
>  tests/scsi/004.out |3 ++
>  2 files changed, 62 insertions(+), 0 deletions(-)
>  create mode 100755 tests/scsi/004
>  create mode 100644 tests/scsi/004.out
> 
> diff --git a/tests/scsi/004 b/tests/scsi/004
> new file mode 100755
> index 000..4852efc
> --- /dev/null
> +++ b/tests/scsi/004
> @@ -0,0 +1,59 @@
> +#!/bin/bash
> +#
> +# Ensure repeated SAM_STAT_TASK_SET_FULL results in EIO on timing out 
> command.
> +#
> +# Regression test for commit cbe095e2b584 ("Revert "scsi: core: return
> +# BLK_STS_OK for DID_OK in __scsi_error_from_host_byte()"")
> +#
> +# Found independently of corresponding commit mail threads while
> +# experimenting with storage mirroring. This test is a storage-independent
> +# reproducer for the error case I ran into.
> +#
> +# Copyright IBM Corp. 2018
> +# Author: Steffen Maier 
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +. common/scsi_debug
> +
> +DESCRIPTION="ensure repeated TASK SET FULL results in EIO on timing out 
> command"
> +
> +requires() {
> + _have_scsi_debug
> +}
> +
> +test() {
> + echo "Running ${TEST_NAME}"
> +
> + if ! _init_scsi_debug add_host=1 max_luns=1 statistics=1 every_nth=1; 
> then
> + return 1
> + fi
> + # every_nth RW with full queue gets SAM_STAT_TASK_SET_FULL
> + echo 0x800 > /sys/bus/pseudo/drivers/scsi_debug/opts
> + # delay to reduce response repetition: around 1..10sec depending on HZ
> + echo 1000 > /sys/bus/pseudo/drivers/scsi_debug/delay
> + # a single command fills device queue to satisfy 0x800 opts condition
> + echo 1 > "/sys/block/${SCSI_DEBUG_DEVICES[0]}/device/queue_depth"
> + dd if="/dev/${SCSI_DEBUG_DEVICES[0]}" iflag=direct of=/dev/null bs=512 
> count=1 |& grep -o "Input/output error"
> + # stop injection
> + echo 0 > /sys/bus/pseudo/drivers/scsi_debug/opts
> + # dd closing SCSI disk causes implicit TUR also being delayed once
> + while grep -q -F "in_use_bm BUSY:" 
> "/proc/scsi/scsi_debug/${SCSI_DEBUG_HOSTS[0]}"; do
> + sleep 1
> + done
> + echo 1 > /sys/bus/pseudo/drivers/scsi_debug/delay
> + _exit_scsi_debug
> +
> + echo "Test complete"
> +}
> diff --git a/tests/scsi/004.out b/tests/scsi/004.out
> new file mode 100644
> index 000..b1126fb
> --- /dev/null
> +++ b/tests/scsi/004.out
> @@ -0,0 +1,3 @@
> +Running scsi/004
> +Input/output error
> +Test complete
> -- 
> 1.7.1
> 

Thanks for the test! Applied.


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jeff Layton
On Thu, 2018-04-19 at 13:26 -0400, Jerome Glisse wrote:
> On Thu, Apr 19, 2018 at 12:58:39PM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> > > On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > > > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > > > any people against having a callback everytime a struct file is added
> > > > > to a task_struct and also having a secondary array so that special
> > > > > file like device file can store something opaque per task_struct per
> > > > > struct file.
> > > > 
> > > > Do you really want something per _thread_, and not per _mm_?
> > > 
> > > Well per mm would be fine but i do not see how to make that happen with
> > > reasonable structure. So issue is that you can have multiple task with
> > > same mm but different file descriptors (or am i wrong here ?) thus there
> > > would be no easy way given a struct file to lookup the per mm struct.
> > > 
> > > So as a not perfect solution i see a new array in filedes which would
> > > allow device driver to store a pointer to their per mm data structure.
> > > To be fair usualy you will only have a single fd in a single task for
> > > a given device.
> > > 
> > > If you see an easy way to get a per mm per inode pointer store somewhere
> > > with easy lookup i am all ears :)
> > > 
> > 
> > I may be misunderstanding, but to be clear: struct files don't get
> > added to a thread, per-se.
> > 
> > When userland calls open() or similar, the struct file gets added to
> > the files_struct. Those are generally shared with other threads within
> > the same process. The files_struct can also be shared with other
> > processes if you clone() with the right flags.
> > 
> > Doing something per-thread on every open may be rather difficult to do.
> 
> Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> add void * *private_data; to struct fdtable (also a default array to
> struct files_struct). The callback would be part of struct file_operations.
> and only call if it exist (os overhead is only for device driver that
> care).
> 
> Did i miss something fundamental ? copy_files() call dup_fd() so i
> should be all set here.
> 
> I will work on patches i was hoping this would not be too much work.
> 

No, I think I misunderstood. I was thinking you wanted to iterate over
all of the threads that might be associated with a struct file, and
that's rather non-trivial.

A callback when you add a file to the files_struct seems like it would
probably be OK (in principle).
-- 
Jeff Layton 


[LSF/MM TOPIC] Extended copy

2018-04-19 Thread Bart Van Assche
During a recent discussion about peer-to-peer PCIe transfers it was brought
up that it would be desirable to invoke this functionality from user space.
One possible approach is to add support for an "extended copy" operation to
the block layer and to make it possible to use that functionality from user
space. So far two different approaches have been proposed to add such
functionality to the block layer:
1. Implement extended copy as a single block layer operation. This is not
   compatible with the current device mapper design.
2. Implement extended copy as two separate block layer operations (read +
   write). With this approach there is a risk of starvation, namely if
   resources are low and only one of the two operations is propagated down
   to lower layers.

I propose to discuss this further during the LSF/MM, e.g. whether or not we
should modify the device mapper such that approach (1) can be followed.

More information is available at:
* Jake Edge, Copy Offload, LWN.net, 24 April 2013
  (https://lwn.net/Articles/548347/).
* Jake Edge, Copy Offload, LWN.net, 2 April 2014
  (https://lwn.net/Articles/592094/).
* Martin Petersen, Copy Offload, linux-scsi, 28 May 2014
  (https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg28998.html).
* Mikulas Patocka, ANNOUNCE: SCSI XCOPY support for the kernel and device
  mapper, 15 July 2014
  (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg686111.html).
* Mikulas Patocka, SCSI XCOPY support for the kernel and device mapper,
  linux-scsi mailing list, 22 October 2014
  (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg749113.html /
   https://lkml.org/lkml/2014/10/22/317 /
   http://thread.gmane.org/gmane.linux.kernel/1811151 /
   http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current/).
* Jake Edge, Copy Offload, 25 March 2015 (https://lwn.net/Articles/637436/).
* Mikulas Patocka, [PATCH 0/15] copy offload patches, 10 December 2015
  (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1039890.html /
   https://lkml.org/lkml/2015/12/10/517 /
   https://www.redhat.com/archives/dm-devel/2015-December/msg00117.html).
* Logan Gunthorpe, [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory, 30
  March 2017 
(https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1367097.html).
* Logan Gunthorpe, [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI
  Memory, 28 February 2018
  (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1623231.html).
* Logan Gunthorpe, [PATCH v3 00/11] Copy Offload in NVMe Fabrics with P2P PCI
  Memory, 12 March 2018
  (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1634545.html).





Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Jens Axboe
On 4/19/18 11:59 AM, Michael Lyle wrote:
> Too much to do with other projects.  I've enjoyed working with everyone
> here, and hope to occasionally contribute on bcache.

It's been a pleasure having someone maintain it upstream, so this is
pretty sad.

Do you have anyone in mind for taking over the upstream part? The
community seems fairly active, and there are regular contributors.

Coly and Tang Junhui come to mind.

-- 
Jens Axboe



Re: extra part in bcache patch commit 539d39eb2708

2018-04-19 Thread Michael Lyle
Hi everyone--

On Wed, Apr 18, 2018 at 10:17 PM, Coly Li  wrote:
> Hi Michael and Jens
>
> When I do back port of bcache patches, I find commit 539d39eb2708
> ("bcache: fix wrong return value in bch_debug_init()") has extra part
> from the original patch which Junhui Tanng posted.

Sorry, I must have messed up a rebase.  The patch it was combined with is here:
https://www.spinics.net/lists/linux-bcache/msg05464.html  and was
properly signed off and reviewed.

Mike


[PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Michael Lyle
Too much to do with other projects.  I've enjoyed working with everyone
here, and hope to occasionally contribute on bcache.

Signed-off-by: Michael Lyle 
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index f9bf4651db41..9fd861906832 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2617,7 +2617,6 @@ S:Maintained
 F: drivers/net/hamradio/baycom*
 
 BCACHE (BLOCK LAYER CACHE)
-M: Michael Lyle 
 M: Kent Overstreet 
 L: linux-bca...@vger.kernel.org
 W: http://bcache.evilpiepirate.org
-- 
2.17.0



Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 12:58:39PM -0400, Jeff Layton wrote:
> On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> > On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > > any people against having a callback everytime a struct file is added
> > > > to a task_struct and also having a secondary array so that special
> > > > file like device file can store something opaque per task_struct per
> > > > struct file.
> > > 
> > > Do you really want something per _thread_, and not per _mm_?
> > 
> > Well per mm would be fine but i do not see how to make that happen with
> > reasonable structure. So issue is that you can have multiple task with
> > same mm but different file descriptors (or am i wrong here ?) thus there
> > would be no easy way given a struct file to lookup the per mm struct.
> > 
> > So as a not perfect solution i see a new array in filedes which would
> > allow device driver to store a pointer to their per mm data structure.
> > To be fair usualy you will only have a single fd in a single task for
> > a given device.
> > 
> > If you see an easy way to get a per mm per inode pointer store somewhere
> > with easy lookup i am all ears :)
> > 
> 
> I may be misunderstanding, but to be clear: struct files don't get
> added to a thread, per-se.
> 
> When userland calls open() or similar, the struct file gets added to
> the files_struct. Those are generally shared with other threads within
> the same process. The files_struct can also be shared with other
> processes if you clone() with the right flags.
> 
> Doing something per-thread on every open may be rather difficult to do.

Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
add void * *private_data; to struct fdtable (also a default array to
struct files_struct). The callback would be part of struct file_operations.
and only call if it exist (os overhead is only for device driver that
care).

Did i miss something fundamental ? copy_files() call dup_fd() so i
should be all set here.

I will work on patches i was hoping this would not be too much work.

Cheers,
Jérôme


Re: [PATCH 1/2] blkcg: small fix on comment in blkcg_init_queue

2018-04-19 Thread Jens Axboe
On 4/19/18 11:10 AM, Randy Dunlap wrote:
> On 04/19/18 07:52, Jens Axboe wrote:
>> On 4/18/18 10:04 PM, Jiang Biao wrote:
>>> The comment before blkg_create() in blkcg_init_queue() was moved
>>> from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but
>>> it does not suit for the new context.
>>
>> Applied - btw, in the future, if you send more than one patch in
>> a series, please include a cover letter.
> 
> Hi Jens,
> 
> Where does that cover letter requirement come from?  It's not documented
> anywhere AFAICT and shouldn't be needed for such small patch series.

It's not a requirement, I just strongly prefer it for anything that is
more than one patch. Hence I tend to let people know about that
preference, if they send patches more than once.

It makes it possible to send comments for the series as a whole.

-- 
Jens Axboe



Re: [PATCH 1/2] blkcg: small fix on comment in blkcg_init_queue

2018-04-19 Thread Randy Dunlap
On 04/19/18 07:52, Jens Axboe wrote:
> On 4/18/18 10:04 PM, Jiang Biao wrote:
>> The comment before blkg_create() in blkcg_init_queue() was moved
>> from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but
>> it does not suit for the new context.
> 
> Applied - btw, in the future, if you send more than one patch in
> a series, please include a cover letter.

Hi Jens,

Where does that cover letter requirement come from?  It's not documented
anywhere AFAICT and shouldn't be needed for such small patch series.

> I've applied 2/2 as well.

thanks,
-- 
~Randy


Re: [PATCH] bsg referencing bus driver module

2018-04-19 Thread Anatoliy Glagolev
+linux-block

On Tue, Apr 17, 2018 at 1:05 PM, Anatoliy Glagolev  wrote:
> Description: bsg_release may crash while decrementing reference to the
> parent device with the following stack:
>
> [16834.636216,07] Call Trace:
>  ...   scsi_proc_hostdir_rm
> [16834.641944,07]  [] scsi_host_dev_release+0x3f/0x130
> [16834.647740,07]  [] device_release+0x32/0xa0
> [16834.653423,07]  [] kobject_cleanup+0x77/0x190
> [16834.659002,07]  [] kobject_put+0x25/0x50
> [16834.664430,07]  [] put_device+0x17/0x20
> [16834.669740,07]  [] bsg_kref_release_function+0x24/0x30
> [16834.675007,07]  [] bsg_release+0x166/0x1d0
> [16834.680148,07]  [] __fput+0xcb/0x1d0
> [16834.685156,07]  [] fput+0xe/0x10
> [16834.690017,07]  [] task_work_run+0x86/0xb0
> [16834.694781,07]  [] exit_to_usermode_loop+0x6b/0x9a
> [16834.699466,07]  [] syscall_return_slowpath+0x55/0x60
> [16834.704110,07]  [] int_ret_from_sys_call+0x25/0x9f
>
> if the parent driver implementing the device has unloaded. To address
> the problem, taking a reference to the parent driver module.
>
> Note: this is a follow-up to earlier discussion "[PATCH] Waiting for
> scsi_host_template release".
>
> ---
>  block/bsg.c  | 31 +++
>  drivers/scsi/scsi_transport_fc.c |  3 ++-
>  include/linux/bsg.h  |  5 +
>  3 files changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/block/bsg.c b/block/bsg.c
> index b9a5361..0aa7d74 100644
> --- a/block/bsg.c
> +++ b/block/bsg.c
> @@ -798,7 +798,8 @@ found:
>   return bd;
>  }
>
> -static struct bsg_device *bsg_get_device(struct inode *inode, struct
> file *file)
> +static struct bsg_device *bsg_get_device(struct inode *inode, struct
> file *file,
> + struct bsg_class_device **pbcd)
>  {
>   struct bsg_device *bd;
>   struct bsg_class_device *bcd;
> @@ -814,6 +815,7 @@ static struct bsg_device *bsg_get_device(struct
> inode *inode, struct file *file)
>
>   if (!bcd)
>   return ERR_PTR(-ENODEV);
> + *pbcd = bcd;
>
>   bd = __bsg_get_device(iminor(inode), bcd->queue);
>   if (bd)
> @@ -829,22 +831,34 @@ static struct bsg_device *bsg_get_device(struct
> inode *inode, struct file *file)
>  static int bsg_open(struct inode *inode, struct file *file)
>  {
>   struct bsg_device *bd;
> + struct bsg_class_device *bcd;
>
> - bd = bsg_get_device(inode, file);
> + bd = bsg_get_device(inode, file, &bcd);
>
>   if (IS_ERR(bd))
>   return PTR_ERR(bd);
>
>   file->private_data = bd;
> + if (bcd->parent_module) {
> + if (!try_module_get(bcd->parent_module)) {
> + bsg_put_device(bd);
> + return -ENODEV;
> + }
> + }
>   return 0;
>  }
>
>  static int bsg_release(struct inode *inode, struct file *file)
>  {
> + int ret;
>   struct bsg_device *bd = file->private_data;
> + struct module *parent_module = bd->queue->bsg_dev.parent_module;
>
>   file->private_data = NULL;
> - return bsg_put_device(bd);
> + ret = bsg_put_device(bd);
> + if (parent_module)
> + module_put(parent_module);
> + return ret;
>  }
>
>  static unsigned int bsg_poll(struct file *file, poll_table *wait)
> @@ -977,6 +991,14 @@ EXPORT_SYMBOL_GPL(bsg_unregister_queue);
>  int bsg_register_queue(struct request_queue *q, struct device *parent,
> const char *name, void (*release)(struct device *))
>  {
> + return bsg_register_queue_ex(q, parent, name, release, NULL);
> +}
> +EXPORT_SYMBOL_GPL(bsg_register_queue);
> +
> +int bsg_register_queue_ex(struct request_queue *q, struct device *parent,
> +  const char *name, void (*release)(struct device *),
> +  struct module *parent_module)
> +{
>   struct bsg_class_device *bcd;
>   dev_t dev;
>   int ret;
> @@ -1012,6 +1034,7 @@ int bsg_register_queue(struct request_queue *q,
> struct device *parent,
>   bcd->queue = q;
>   bcd->parent = get_device(parent);
>   bcd->release = release;
> + bcd->parent_module = parent_module;
>   kref_init(&bcd->ref);
>   dev = MKDEV(bsg_major, bcd->minor);
>   class_dev = device_create(bsg_class, parent, dev, NULL, "%s", devname);
> @@ -1039,7 +1062,7 @@ unlock:
>   mutex_unlock(&bsg_mutex);
>   return ret;
>  }
> -EXPORT_SYMBOL_GPL(bsg_register_queue);
> +EXPORT_SYMBOL_GPL(bsg_register_queue_ex);
>
>  static struct cdev bsg_cdev;
>
> diff --git a/drivers/scsi/scsi_transport_fc.c 
> b/drivers/scsi/scsi_transport_fc.c
> index 24eaaf6..c153f80 100644
> --- a/drivers/scsi/scsi_transport_fc.c
> +++ b/drivers/scsi/scsi_transport_fc.c
> @@ -4064,7 +4064,8 @@ fc_bsg_hostadd(struct Scsi_Host *shost, struct
> fc_host_attrs *fc_host)
>   blk_queue_rq_timed_out(q, fc_bsg_job_timeout);
>   blk_queue_rq_timeout(q, FC_DEFAULT_BSG_TIMEOUT);
>
> - err = bsg_register_queue(q, dev, bsg_name, NULL);
> + err = bsg_register_queue_ex(q, dev, bsg_name, NULL,
> + shost->hostt->module);
>   if (err) {
>   printk(KERN_ERR "fc_host%d: bsg interface failed to "
>   "initialize - register queue\n",
> diff --git a/include/linux/bsg.h b/include/linux/bsg.h
> index 7173f6e..fe41e83 100644
> --- a/include/linux/bsg.h
> +++ b/include

WARNING in __device_add_disk

2018-04-19 Thread syzbot

Hello,

syzbot hit the following crash on upstream commit
87ef12027b9b1dd0e0b12cf311fbcb19f9d92539 (Wed Apr 18 19:48:17 2018 +)
Merge tag 'ceph-for-4.17-rc2' of git://github.com/ceph/ceph-client
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=3337db851ace689ceb50


Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5356500264419328
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=1808800213120130118

compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3337db851ace689ce...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

RAX: ffda RBX: 7fee9fefc6d4 RCX: 00455329
RDX: 2000 RSI: 4c80 RDI: 0014
RBP: 0072bea0 R08:  R09: 
R10:  R11: 0246 R12: 0015
R13: 02a2 R14: 006f6fd0 R15: 0036
WARNING: CPU: 1 PID: 30128 at block/genhd.c:685  
__device_add_disk+0x104a/0x1340 block/genhd.c:685

Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 30128 Comm: syz-executor3 Not tainted 4.17.0-rc1+ #8
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 panic+0x22f/0x4de kernel/panic.c:184
 __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
 report_bug+0x252/0x2d0 lib/bug.c:186
 fixup_bug arch/x86/kernel/traps.c:178 [inline]
 do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:__device_add_disk+0x104a/0x1340 block/genhd.c:685
RSP: 0018:8801b320f8a8 EFLAGS: 00010246
RAX: 0004 RBX: 8801d7fbe880 RCX: c90007714000
RDX: 0004 RSI: 83434d6a RDI: 0005
RBP: 8801b320fa78 R08: 8801a9364400 R09: ed003b5e46c2
R10: ed003b5e46c2 R11: 8801daf23613 R12: fff4
R13: 8801b320fa50 R14: 8801d7fbede0 R15: 8801d7fbe884
 device_add_disk+0x22/0x30 block/genhd.c:705
 add_disk include/linux/genhd.h:397 [inline]
 loop_add+0x70b/0x9c0 drivers/block/loop.c:1873
 loop_control_ioctl+0x178/0x500 drivers/block/loop.c:1970
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x16a0 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455329
RSP: 002b:7fee9fefbc68 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7fee9fefc6d4 RCX: 00455329
RDX: 2000 RSI: 4c80 RDI: 0014
RBP: 0072bea0 R08:  R09: 
R10:  R11: 0246 R12: 0015
R13: 02a2 R14: 006f6fd0 R15: 0036
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jeff Layton
On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > any people against having a callback everytime a struct file is added
> > > to a task_struct and also having a secondary array so that special
> > > file like device file can store something opaque per task_struct per
> > > struct file.
> > 
> > Do you really want something per _thread_, and not per _mm_?
> 
> Well per mm would be fine but i do not see how to make that happen with
> reasonable structure. So issue is that you can have multiple task with
> same mm but different file descriptors (or am i wrong here ?) thus there
> would be no easy way given a struct file to lookup the per mm struct.
> 
> So as a not perfect solution i see a new array in filedes which would
> allow device driver to store a pointer to their per mm data structure.
> To be fair usualy you will only have a single fd in a single task for
> a given device.
> 
> If you see an easy way to get a per mm per inode pointer store somewhere
> with easy lookup i am all ears :)
> 

I may be misunderstanding, but to be clear: struct files don't get
added to a thread, per-se.

When userland calls open() or similar, the struct file gets added to
the files_struct. Those are generally shared with other threads within
the same process. The files_struct can also be shared with other
processes if you clone() with the right flags.

Doing something per-thread on every open may be rather difficult to do.
-- 
Jeff Layton 


[PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-19 Thread Bart Van Assche
The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.

Fix this race as follows:
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
  enumeration member into a #define such that its type can be changed
  into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
  ~(unsigned long)RQ_STATE_MASK.
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc:  # v4.16
---

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.


 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 158 ++---
 block/blk-mq.h |  85 +-
 block/blk-timeout.c|  89 
 block/blk.h|  13 ++--
 include/linux/blkdev.h |  29 +++--
 7 files changed, 154 insertions(+), 227 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index de90ecab61cd..730a8e3be7ce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,12 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(&rq->gstate_seq);
-   u64_stats_init(&rq->aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index adb8d6f00098..529383841b3b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -346,7 +346,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bb7f59d319fa..6f20845827f4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -481,7 +481,8 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -527,8 +528,7 @@ static void __blk_mq_complete_req

[PATCH 0/3] Rework write error handling in pblk

2018-04-19 Thread Hans Holmberg
From: Hans Holmberg 

This patch series fixes the(currently incomplete) write error handling 
in pblk by:

 * queuing and re-submitting failed writes in the write buffer
 * evacuating valid data data in lines with write failures, so the
   chunk(s) with write failures can be reset to a known state by the fw

Lines with failures in smeta are put back on the free list.
Failed chunks will be reset on the next use.

If a write failes in emeta, the lba list is cached so the line can be 
garbage collected without scanning the out-of-band area.

Hans Holmberg (3):
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: fix smeta write error path

 drivers/lightnvm/pblk-core.c |  50 +++-
 drivers/lightnvm/pblk-gc.c   |  79 -
 drivers/lightnvm/pblk-init.c |  41 +--
 drivers/lightnvm/pblk-recovery.c |  91 ---
 drivers/lightnvm/pblk-rl.c   |  29 -
 drivers/lightnvm/pblk-sysfs.c|  15 ++-
 drivers/lightnvm/pblk-write.c| 243 ---
 drivers/lightnvm/pblk.h  |  33 --
 8 files changed, 362 insertions(+), 219 deletions(-)

-- 
2.7.4



[PATCH 2/3] lightnvm: pblk: garbage collect lines with failed writes

2018-04-19 Thread Hans Holmberg
From: Hans Holmberg 

Write failures should not happen under normal circumstances,
so in order to bring the chunk back into a known state as soon
as possible, evacuate all the valid data out of the line and let the
fw judge if the block can be written to in the next reset cycle.

Do this by introducing a new gc list for lines with failed writes,
and ensure that the rate limiter allocates a small portion of
the write bandwidth to get the job done.

The lba list is saved in memory for use during gc as we
cannot gurantee that the emeta data is readable if a write
error occurred.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c  | 43 +--
 drivers/lightnvm/pblk-gc.c| 79 +++
 drivers/lightnvm/pblk-init.c  | 39 ++---
 drivers/lightnvm/pblk-rl.c| 29 +---
 drivers/lightnvm/pblk-sysfs.c | 15 ++--
 drivers/lightnvm/pblk-write.c |  2 ++
 drivers/lightnvm/pblk.h   | 25 +++---
 7 files changed, 178 insertions(+), 54 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 7762e89..f6135e4 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -373,7 +373,13 @@ struct list_head *pblk_line_gc_list(struct pblk *pblk, 
struct pblk_line *line)
 
lockdep_assert_held(&line->lock);
 
-   if (!vsc) {
+   if (line->w_err_gc->has_write_err) {
+   if (line->gc_group != PBLK_LINEGC_WERR) {
+   line->gc_group = PBLK_LINEGC_WERR;
+   move_list = &l_mg->gc_werr_list;
+   pblk_rl_werr_line_in(&pblk->rl);
+   }
+   } else if (!vsc) {
if (line->gc_group != PBLK_LINEGC_FULL) {
line->gc_group = PBLK_LINEGC_FULL;
move_list = &l_mg->gc_full_list;
@@ -1603,8 +1609,13 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
line->state = PBLK_LINESTATE_FREE;
line->gc_group = PBLK_LINEGC_NONE;
pblk_line_free(line);
-   spin_unlock(&line->lock);
 
+   if (line->w_err_gc->has_write_err) {
+   pblk_rl_werr_line_out(&pblk->rl);
+   line->w_err_gc->has_write_err = 0;
+   }
+
+   spin_unlock(&line->lock);
atomic_dec(&gc->pipeline_gc);
 
spin_lock(&l_mg->free_lock);
@@ -1767,11 +1778,32 @@ void pblk_line_close_meta(struct pblk *pblk, struct 
pblk_line *line)
 
spin_lock(&l_mg->close_lock);
spin_lock(&line->lock);
+
+   /* Update the in-memory start address for emeta, in case it has
+* shifted due to write errors
+*/
+   if (line->emeta_ssec != line->cur_sec)
+   line->emeta_ssec = line->cur_sec;
+
list_add_tail(&line->list, &l_mg->emeta_list);
spin_unlock(&line->lock);
spin_unlock(&l_mg->close_lock);
 
pblk_line_should_sync_meta(pblk);
+
+
+}
+
+static void pblk_save_lba_list(struct pblk *pblk, struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = &pblk->lm;
+   unsigned int lba_list_size = lm->emeta_len[2];
+   struct pblk_w_err_gc *w_err_gc = line->w_err_gc;
+   struct pblk_emeta *emeta = line->emeta;
+
+   w_err_gc->lba_list = kmalloc(lba_list_size, GFP_KERNEL);
+   memcpy(w_err_gc->lba_list, emeta_to_lbas(pblk, emeta->buf),
+   lba_list_size);
 }
 
 void pblk_line_close_ws(struct work_struct *work)
@@ -1780,6 +1812,13 @@ void pblk_line_close_ws(struct work_struct *work)
ws);
struct pblk *pblk = line_ws->pblk;
struct pblk_line *line = line_ws->line;
+   struct pblk_w_err_gc *w_err_gc = line->w_err_gc;
+
+   /* Write errors makes the emeta start address stored in smeta invalid,
+* so keep a copy of the lba list until we've gc'd the line
+*/
+   if (w_err_gc->has_write_err)
+   pblk_save_lba_list(pblk, line);
 
pblk_line_close(pblk, line);
mempool_free(line_ws, pblk->gen_ws_pool);
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index b0cc277..62f0548 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -138,10 +138,10 @@ static void pblk_gc_line_prepare_ws(struct work_struct 
*work)
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_gc *gc = &pblk->gc;
-   struct line_emeta *emeta_buf;
+   struct line_emeta *emeta_buf = NULL;
struct pblk_line_ws *gc_rq_ws;
struct pblk_gc_rq *gc_rq;
-   __le64 *lba_list;
+   __le64 *lba_list = NULL;
unsigned long *invalid_bitmap;
int sec_left, nr_secs, bit;
int ret;
@@ -150,34 +150,42 @@ static void pblk_gc_line_prepare_ws(struct work_struct 
*work)
if (!invalid_bitmap)
goto fail_free_ws;
 
-   emeta_b

[PATCH 3/3] lightnvm: pblk: fix smeta write error path

2018-04-19 Thread Hans Holmberg
From: Hans Holmberg 

Smeta write errors were previously ignored. Skip these
lines instead and throw them back on the free
list, so the chunks will go through a reset cycle
before we attempt to use the line again.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index f6135e4..485fe8c 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
atomic_dec(&pblk->inflight_io);
 
if (rqd.error) {
-   if (dir == PBLK_WRITE)
+   if (dir == PBLK_WRITE) {
pblk_log_write_err(pblk, &rqd);
-   else if (dir == PBLK_READ)
+   ret = 1;
+   } else if (dir == PBLK_READ)
pblk_log_read_err(pblk, &rqd);
}
 
@@ -1120,7 +1121,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
 
if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) {
pr_debug("pblk: line smeta I/O failed. Retry\n");
-   return 1;
+   return 0;
}
 
bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line);
-- 
2.7.4



[PATCH 1/3] lightnvm: pblk: rework write error recovery path

2018-04-19 Thread Hans Holmberg
From: Hans Holmberg 

The write error recovery path is incomplete, so rework
the write error recovery handling to do resubmits directly
from the write buffer.

When a write error occurs, the remaining sectors in the chunk are
mapped out and invalidated and the request inserted in a resubmit list.

The writer thread checks if there are any requests to resubmit,
scans and invalidates any lbas that have been overwritten by later
writes and resubmits the failed entries.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-init.c |   2 +
 drivers/lightnvm/pblk-recovery.c |  91 ---
 drivers/lightnvm/pblk-write.c| 241 ---
 drivers/lightnvm/pblk.h  |   8 +-
 4 files changed, 180 insertions(+), 162 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index bfc488d..6f06727 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -426,6 +426,7 @@ static int pblk_core_init(struct pblk *pblk)
goto free_r_end_wq;
 
INIT_LIST_HEAD(&pblk->compl_list);
+   INIT_LIST_HEAD(&pblk->resubmit_list);
 
return 0;
 
@@ -1185,6 +1186,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk->state = PBLK_STATE_RUNNING;
pblk->gc.gc_enabled = 0;
 
+   spin_lock_init(&pblk->resubmit_lock);
spin_lock_init(&pblk->trans_lock);
spin_lock_init(&pblk->lock);
 
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 9cb6d5d..5983428 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -16,97 +16,6 @@
 
 #include "pblk.h"
 
-void pblk_submit_rec(struct work_struct *work)
-{
-   struct pblk_rec_ctx *recovery =
-   container_of(work, struct pblk_rec_ctx, ws_rec);
-   struct pblk *pblk = recovery->pblk;
-   struct nvm_rq *rqd = recovery->rqd;
-   struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
-   struct bio *bio;
-   unsigned int nr_rec_secs;
-   unsigned int pgs_read;
-   int ret;
-
-   nr_rec_secs = bitmap_weight((unsigned long int *)&rqd->ppa_status,
-   NVM_MAX_VLBA);
-
-   bio = bio_alloc(GFP_KERNEL, nr_rec_secs);
-
-   bio->bi_iter.bi_sector = 0;
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
-   rqd->bio = bio;
-   rqd->nr_ppas = nr_rec_secs;
-
-   pgs_read = pblk_rb_read_to_bio_list(&pblk->rwb, bio, &recovery->failed,
-   nr_rec_secs);
-   if (pgs_read != nr_rec_secs) {
-   pr_err("pblk: could not read recovery entries\n");
-   goto err;
-   }
-
-   if (pblk_setup_w_rec_rq(pblk, rqd, c_ctx)) {
-   pr_err("pblk: could not setup recovery request\n");
-   goto err;
-   }
-
-#ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(nr_rec_secs, &pblk->recov_writes);
-#endif
-
-   ret = pblk_submit_io(pblk, rqd);
-   if (ret) {
-   pr_err("pblk: I/O submission failed: %d\n", ret);
-   goto err;
-   }
-
-   mempool_free(recovery, pblk->rec_pool);
-   return;
-
-err:
-   bio_put(bio);
-   pblk_free_rqd(pblk, rqd, PBLK_WRITE);
-}
-
-int pblk_recov_setup_rq(struct pblk *pblk, struct pblk_c_ctx *c_ctx,
-   struct pblk_rec_ctx *recovery, u64 *comp_bits,
-   unsigned int comp)
-{
-   struct nvm_rq *rec_rqd;
-   struct pblk_c_ctx *rec_ctx;
-   int nr_entries = c_ctx->nr_valid + c_ctx->nr_padded;
-
-   rec_rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
-   rec_ctx = nvm_rq_to_pdu(rec_rqd);
-
-   /* Copy completion bitmap, but exclude the first X completed entries */
-   bitmap_shift_right((unsigned long int *)&rec_rqd->ppa_status,
-   (unsigned long int *)comp_bits,
-   comp, NVM_MAX_VLBA);
-
-   /* Save the context for the entries that need to be re-written and
-* update current context with the completed entries.
-*/
-   rec_ctx->sentry = pblk_rb_wrap_pos(&pblk->rwb, c_ctx->sentry + comp);
-   if (comp >= c_ctx->nr_valid) {
-   rec_ctx->nr_valid = 0;
-   rec_ctx->nr_padded = nr_entries - comp;
-
-   c_ctx->nr_padded = comp - c_ctx->nr_valid;
-   } else {
-   rec_ctx->nr_valid = c_ctx->nr_valid - comp;
-   rec_ctx->nr_padded = c_ctx->nr_padded;
-
-   c_ctx->nr_valid = comp;
-   c_ctx->nr_padded = 0;
-   }
-
-   recovery->rqd = rec_rqd;
-   recovery->pblk = pblk;
-
-   return 0;
-}
-
 int pblk_recov_check_emeta(struct pblk *pblk, struct line_emeta *emeta_buf)
 {
u32 crc;
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 3e6f1eb..ab45157 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/driv

Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > Oh can i get one more small slot for fs ? I want to ask if they are
> > any people against having a callback everytime a struct file is added
> > to a task_struct and also having a secondary array so that special
> > file like device file can store something opaque per task_struct per
> > struct file.
> 
> Do you really want something per _thread_, and not per _mm_?

Well per mm would be fine but i do not see how to make that happen with
reasonable structure. So issue is that you can have multiple task with
same mm but different file descriptors (or am i wrong here ?) thus there
would be no easy way given a struct file to lookup the per mm struct.

So as a not perfect solution i see a new array in filedes which would
allow device driver to store a pointer to their per mm data structure.
To be fair usualy you will only have a single fd in a single task for
a given device.

If you see an easy way to get a per mm per inode pointer store somewhere
with easy lookup i am all ears :)

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Martin K. Petersen

Chris,

>> I'd like to propose that we compact the fs sessions so that we get a
>> 3-slot session reserved for "Individual filesystem discussions" one
>> afternoon. That way we've got time in the schedule for the all the
>> ext4/btrfs/XFS/NFS/CIFS devs to get together with each other and
>> talk about things of interest only to their own fileystems.
>>
>> That means we all don't have to find time outside the schedule to do
>> this, and think this wold be time very well spent for most fs people
>> at the conf
>
> I'd love this as well.

Based on feedback last year we explicitly added a third day to LSF/MM to
facilitate hack time and project meetings.

As usual the schedule is fluid and will be adjusted on the fly.
Depending on track, I am hoping we'll be done with the scheduled topics
either at the end of Tuesday or Wednesday morning.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 1/2] blkcg: small fix on comment in blkcg_init_queue

2018-04-19 Thread Jens Axboe
On 4/18/18 10:04 PM, Jiang Biao wrote:
> The comment before blkg_create() in blkcg_init_queue() was moved
> from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but
> it does not suit for the new context.

Applied - btw, in the future, if you send more than one patch in
a series, please include a cover letter.

I've applied 2/2 as well.

-- 
Jens Axboe



Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 12:32:50PM +0200, Jan Kara wrote:
> On Wed 18-04-18 11:54:30, Jerome Glisse wrote:

[...]

> > I am affraid truely generic write protection for metadata pages is bit
> > out of scope of what i am doing. However the mechanism i am proposing
> > can be extended for that too. Issue is that all place that want to write
> > to those page need to be converted to something where write happens
> > between write_begin and write_end section (mmap and CPU pte does give
> > this implicitly through page fault, so does write syscall). Basicly
> > there is a need to make sure that write and write protection can be
> > ordered against one another without complex locking.
> 
> I understand metadata pages are not interesting for your use case. However
> from mm point of view these are page cache pages as any other. So maybe my
> question should have been: How do we make sure this mechanism will not be
> used for pages for which it cannot work?

Oh that one is easy, the API take vma + addr or rather mm struct + addr
(ie like KSM today kind of). I will change wording in v1 to almost
generic write protection :) or process' page write protection (but this
would not work for special pfn/vma so not generic their either).

> > > > A write protected page has page->mapping pointing to a structure like
> > > > struct rmap_item for KSM. So this structure has a list for each unique
> > > > combination:
> > > > struct write_protect {
> > > > struct list_head *mappings; /* write_protect_mapping list */
> > > > ...
> > > > };
> > > > 
> > > > struct write_protect_mapping {
> > > > struct list_head list
> > > > struct address_space *mapping;
> > > > unsigned long offset;
> > > > unsigned long private;
> > > > ...
> > > > };
> > > 
> > > Auch, the fact that we could share a page as data storage for several
> > > inode+offset combinations that are not sharing underlying storage just
> > > looks viciously twisted ;) But is it really that useful to warrant
> > > complications? In particular I'm afraid that filesystems expect 
> > > consistency
> > > between their internal state (attached to page->private) and page state
> > > (e.g. page->flags) and when there are multiple internal states attached to
> > > the same page this could go easily wrong...
> > 
> > So at first i want to limit to write protect (not KSM) thus page->flags
> > will stay consistent (ie page is only ever associated with a single
> > mapping). For KSM yes the page->flags can be problematic, however here
> > we can assume that page is clean (and uptodate) and not under write
> > back. So problematic flags for KSM:
> >   - private (page_has_buffers() or PagePrivate (nfs, metadata, ...))
> >   - private_2 (FsCache)
> >   - mappedtodisk
> >   - swapcache
> >   - error
> > 
> > Idea again would be to PageFlagsWithMapping(page, mapping) so that for
> > non KSM write protected page you test the usual page->flags and for
> > write protected page you find the flag value using mapping as lookup
> > index. Usualy those flag are seldomly changed/accessed. Again the
> > overhead (ignoring code size) would only be for page which are KSM.
> > So maybe KSM will not make sense because perf overhead it has with
> > page->flags access (i don't think so but i haven't tested this).
> 
> Yeah, sure, page->flags could be dealt with in a similar way but at this
> point I don't think it's worth it. And without page->flags I don't think
> abstracting page->private makes much sense - or am I missing something why
> you need page->private depend on the mapping? So what I wanted to suggest
> is that we leave page->private as is currently and just concentrate on
> page->mapping hacks...

Well i wanted to go up to KSM or at least as close as possible to KSM
for file back page. But i can focus on page->mapping first, do write
protection with that and also do the per page wait queue for page lock.
Which i believe are both nice features. This will also make the patchset
smaller and easier to review (less scary).

KSM can be done on top of that latter and i will be happy to help. I
have a bunch of coccinelle patches for page->private, page->index and
i can do some for page->flags.

Cheers,
Jérôme


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Chris Mason
On 18 Apr 2018, at 21:55, Dave Chinner wrote:

> On Wed, Apr 18, 2018 at 05:19:39PM -0400, Jerome Glisse wrote:
>> Just wanted to suggest to push HMM status down one slot in the
>> agenda to avoid having FS and MM first going into their own
>> room and then merging back for GUP and DAX, and re-splitting
>> after. More over HMM and NUMA talks will be good to have back
>> to back as they deal with same kind of thing mostly.
>
> So while we are talking about schedule suggestions, we see that
> there's lots of empty slots in the FS track. We (xfs guys) were just
> chatting on #xfs about whether we'd have time to have a "XFS devel
> meeting" at some point during LSF/MM as we are rarely in the same
> place at the same time.
>
> I'd like to propose that we compact the fs sessions so that we get a
> 3-slot session reserved for "Individual filesystem discussions" one
> afternoon. That way we've got time in the schedule for the all the
> ext4/btrfs/XFS/NFS/CIFS devs to get together with each other and
> talk about things of interest only to their own fileystems.
>
> That means we all don't have to find time outside the schedule to do
> this, and think this wold be time very well spent for most fs people
> at the conf

I'd love this as well.

-chris


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Matthew Wilcox
On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> Oh can i get one more small slot for fs ? I want to ask if they are
> any people against having a callback everytime a struct file is added
> to a task_struct and also having a secondary array so that special
> file like device file can store something opaque per task_struct per
> struct file.

Do you really want something per _thread_, and not per _mm_?


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jerome Glisse
On Thu, Apr 19, 2018 at 11:55:08AM +1000, Dave Chinner wrote:
> On Wed, Apr 18, 2018 at 05:19:39PM -0400, Jerome Glisse wrote:
> > Just wanted to suggest to push HMM status down one slot in the
> > agenda to avoid having FS and MM first going into their own
> > room and then merging back for GUP and DAX, and re-splitting
> > after. More over HMM and NUMA talks will be good to have back
> > to back as they deal with same kind of thing mostly.
> 
> So while we are talking about schedule suggestions, we see that
> there's lots of empty slots in the FS track. We (xfs guys) were just
> chatting on #xfs about whether we'd have time to have a "XFS devel
> meeting" at some point during LSF/MM as we are rarely in the same
> place at the same time.
> 
> I'd like to propose that we compact the fs sessions so that we get a
> 3-slot session reserved for "Individual filesystem discussions" one
> afternoon. That way we've got time in the schedule for the all the
> ext4/btrfs/XFS/NFS/CIFS devs to get together with each other and
> talk about things of interest only to their own fileystems.
> 
> That means we all don't have to find time outside the schedule to do
> this, and think this wold be time very well spent for most fs people
> at the conf

Oh can i get one more small slot for fs ? I want to ask if they are
any people against having a callback everytime a struct file is added
to a task_struct and also having a secondary array so that special
file like device file can store something opaque per task_struct per
struct file.

I will try to stich a patchset tomorrow for that. A lot of device
driver would like to have this.

Cheers,
Jérôme


lockdep splats with blk-mq drivers

2018-04-19 Thread Sebastian Ott
Since commit 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a
RCU and generation based scheme") I observe lockdep complaints (full
message attached):

[   21.763369]CPU0CPU1
[   21.763370]
[   21.763371]   lock(&rq->gstate_seq);
[   21.763374]local_irq_disable();
[   21.763375]lock(&(&dq->lock)->rlock);
[   21.763377]lock(&rq->gstate_seq);
[   21.763379]   
[   21.763380] lock(&(&dq->lock)->rlock);
[   21.763382] 
*** DEADLOCK ***


This only happens in combination with DASD (s390 specific) and another
blk-mq driver (scsi, null_blk). The distinctive behavior of DASD is that
it calls blk_mq_start_request with irqs disabled.

This is a false positive since gstate_seq on CPU0 is from a different
request queue / block driver than gstate_seq on CPU1.

Possible fixes are to use local_irq_save/restore in blk_mq_start_request.
Or, since it's a false positive, teach lockdep that these are different
locks - like below:

-->8
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..d1a2c2fa015d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2040,7 +2040,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
*set, struct request *rq,
return ret;
}
 
-   seqcount_init(&rq->gstate_seq);
+   __seqcount_init(&rq->gstate_seq, "&rq->gstate_seq", 
set->rq_seqcount_key);
u64_stats_init(&rq->aborted_gstate_sync);
return 0;
 }
@@ -2774,7 +2774,7 @@ static int blk_mq_update_queue_map(struct blk_mq_tag_set 
*set)
  * requested depth down, if if it too large. In that case, the set
  * value will be stored in set->queue_depth.
  */
-int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
+int __blk_mq_alloc_tag_set(struct blk_mq_tag_set *set, struct lock_class_key 
*key)
 {
int ret;
 
@@ -2825,6 +2825,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
if (!set->mq_map)
goto out_free_tags;
 
+   set->rq_seqcount_key = key;
ret = blk_mq_update_queue_map(set);
if (ret)
goto out_free_mq_map;
@@ -2846,7 +2847,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
set->tags = NULL;
return ret;
 }
-EXPORT_SYMBOL(blk_mq_alloc_tag_set);
+EXPORT_SYMBOL(__blk_mq_alloc_tag_set);
 
 void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3986f4b3461..1fb8bc1fdb07 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -85,6 +85,8 @@ struct blk_mq_tag_set {
 
struct mutextag_list_lock;
struct list_headtag_list;
+
+   struct lock_class_key   *rq_seqcount_key;
 };
 
 struct blk_mq_queue_data {
@@ -201,7 +203,15 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 int blk_mq_register_dev(struct device *, struct request_queue *);
 void blk_mq_unregister_dev(struct device *, struct request_queue *);
 
-int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set);
+int __blk_mq_alloc_tag_set(struct blk_mq_tag_set *set, struct lock_class_key 
*key);
+
+static inline int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
+{
+   static struct lock_class_key _seqcount_key;
+
+   return __blk_mq_alloc_tag_set(set, &_seqcount_key);
+}
+
 void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
 
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);[   74.182295] WARNING: possible irq lock inversion dependency detected
[   74.182296] 4.16.0-11766-ge241e3f2bf97 #166 Not tainted
[   74.182297] 
[   74.182298] kworker/u64:7/175 just changed the state of lock:
[   74.182301] 1b8db3f6 (&rq->gstate_seq){+.+.}, at: 
scsi_queue_rq+0x460/0x578
[   74.182308] but this lock was taken by another, SOFTIRQ-safe lock in the 
past:
[   74.182310]  (&(&dq->lock)->rlock){..-.}
[   74.182311] 
   
   and interrupts could create inverse lock ordering between them.

[   74.182315] 
   other info that might help us debug this:
[   74.182316]  Possible interrupt unsafe locking scenario:

[   74.182318]CPU0CPU1
[   74.182319]
[   74.182320]   lock(&rq->gstate_seq);
[   74.182340]local_irq_disable();
[   74.182341]lock(&(&dq->lock)->rlock);
[   74.182343]lock(&rq->gstate_seq);
[   74.182345]   
[   74.182346] lock(&(&dq->lock)->rlock);
[   74.182348] 
*** DEADLOCK ***

[   74.182350] 4 locks held by kworker/u64:7/175:
[   74.182351]  #0: 1751423f 
((wq_completion)"%s"shost->work_q_name){+.+.}, at: process_one_work+0x1a0/0x6f0
[   74.182359]  #1: 1c880b65 
((work_completion)(&r

Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue

2018-04-19 Thread Jan Kara
On Wed 18-04-18 11:54:30, Jerome Glisse wrote:
> > Overall I think you'd need to make a good benchmarking comparison showing
> > how much this helps some real workloads (your motivation) and also how
> > other loads on lower end machines are affected.
> 
> Do you have any specific benchmark you would like to see ? My list was:
>   https://github.com/01org/lkp-tests
>   https://github.com/gormanm/mmtests

So e.g. mmtests have a *lot* of different tests so it's probably not
realistic for you to run them all. I'd look at bonnie++ (file & dir tests),
dbench, reaim - these are crappy IO benchmarks because they mostly fit into
page cache but for your purposes this is exactly what you want to see
differences in CPU overhead :).

> > > --
> > > The What ?
> > > 
> > > Aim of this patch serie is to introduce generic page write protection
> > > for any kind of regular page in a process (private anonymous or back
> > > by regular file). This feature already exist, in one form, for private
> > > anonymous page, as part of KSM (Kernel Share Memory).
> > > 
> > > So this patch serie is two fold. First it factors out the page write
> > > protection of KSM into a generic write protection mechanim which KSM
> > > becomes the first user of. Then it add support for regular file back
> > > page memory (regular file or share memory aka shmem). To achieve this
> > > i need to cut the dependency lot of code have on page->mapping so i
> > > can set page->mapping to point to special structure when write
> > > protected.
> > 
> > So I'm interested in this write protection mechanism but I didn't find much
> > about it in the series. How does it work? I can see KSM writeprotects pages
> > in page tables so that works for userspace mappings but what about
> > in-kernel users modifying pages - e.g. pages in page cache carrying
> > filesystem metadata do get modified a lot like this.
> 
> So i only care about page which are mmaped into a process address space.
> At first i only want to intercept CPU write access through mmap of file
> but i also intend to extend write syscall to also "fault" on the write
> protection ie it will call a callback to unprotect the page allowing the
> write protector to take proper action while write syscall is happening.
> 
> I am affraid truely generic write protection for metadata pages is bit
> out of scope of what i am doing. However the mechanism i am proposing
> can be extended for that too. Issue is that all place that want to write
> to those page need to be converted to something where write happens
> between write_begin and write_end section (mmap and CPU pte does give
> this implicitly through page fault, so does write syscall). Basicly
> there is a need to make sure that write and write protection can be
> ordered against one another without complex locking.

I understand metadata pages are not interesting for your use case. However
from mm point of view these are page cache pages as any other. So maybe my
question should have been: How do we make sure this mechanism will not be
used for pages for which it cannot work?

> > > A write protected page has page->mapping pointing to a structure like
> > > struct rmap_item for KSM. So this structure has a list for each unique
> > > combination:
> > > struct write_protect {
> > > struct list_head *mappings; /* write_protect_mapping list */
> > > ...
> > > };
> > > 
> > > struct write_protect_mapping {
> > > struct list_head list
> > > struct address_space *mapping;
> > > unsigned long offset;
> > > unsigned long private;
> > > ...
> > > };
> > 
> > Auch, the fact that we could share a page as data storage for several
> > inode+offset combinations that are not sharing underlying storage just
> > looks viciously twisted ;) But is it really that useful to warrant
> > complications? In particular I'm afraid that filesystems expect consistency
> > between their internal state (attached to page->private) and page state
> > (e.g. page->flags) and when there are multiple internal states attached to
> > the same page this could go easily wrong...
> 
> So at first i want to limit to write protect (not KSM) thus page->flags
> will stay consistent (ie page is only ever associated with a single
> mapping). For KSM yes the page->flags can be problematic, however here
> we can assume that page is clean (and uptodate) and not under write
> back. So problematic flags for KSM:
>   - private (page_has_buffers() or PagePrivate (nfs, metadata, ...))
>   - private_2 (FsCache)
>   - mappedtodisk
>   - swapcache
>   - error
> 
> Idea again would be to PageFlagsWithMapping(page, mapping) so that for
> non KSM write protected page you test the usual page->flags and for
> write protected page you find the flag value using mapping as lookup
> index. Usualy those flag are seldomly changed/accessed. Again the
> overhead (ignoring code size) wou

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-19 Thread Paolo Valente


> Il giorno 18 apr 2018, alle ore 16:30, Jens Axboe  ha 
> scritto:
> 
> On 4/18/18 3:08 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 18 apr 2018, alle ore 00:57, Jens Axboe  ha 
>>> scritto:
>>> 
>>> On 4/17/18 3:48 PM, Jens Axboe wrote:
 On 4/17/18 3:47 PM, Kees Cook wrote:
> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe  wrote:
>> On 4/17/18 3:25 PM, Kees Cook wrote:
>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook  
>>> wrote:
 I see elv.priv[1] assignments made in a few places -- is it possible
 there is some kind of uninitialized-but-not-NULL state that can leak
 in there?
>>> 
>>> Got it. This fixes it for me:
>>> 
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 0dc9e341c2a7..859df3160303 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct
>>> request_queue *q,
>>> 
>>>   rq = blk_mq_rq_ctx_init(data, tag, op);
>>>   if (!op_is_flush(op)) {
>>> -   rq->elv.icq = NULL;
>>> +   memset(&rq->elv, 0, sizeof(rq->elv));
>>>   if (e && e->type->ops.mq.prepare_request) {
>>>   if (e->type->icq_cache && rq_ioc(bio))
>>>   blk_mq_sched_assign_ioc(rq, bio);
>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq)
>>>   e->type->ops.mq.finish_request(rq);
>>>   if (rq->elv.icq) {
>>>   put_io_context(rq->elv.icq->ioc);
>>> -   rq->elv.icq = NULL;
>>> +   memset(&rq->elv, 0, sizeof(rq->elv));
>>>   }
>>>   }
>> 
>> This looks like a BFQ problem, this should not be necessary. Paolo,
>> you're calling your own prepare request handler from the insert
>> as well, and your prepare request does nothing if rq->elv.icq == NULL.
> 
> I sent the patch anyway, since it's kind of a robustness improvement,
> I'd hope. If you fix BFQ also, please add:
 
 It's also a memset() in the hot path, would prefer to avoid that...
 The issue here is really the convoluted bfq usage of insert/prepare,
 I'm sure Paolo can take it from here.
>>> 
>> 
>> Hi,
>> I'm very sorry for tuning in very late, but, at the same time, very
>> glad to find the problem probably already solved ;) (in this respect, I 
>> swear,
>> my delay was not intentional)
>> 
>>> Does this fix it?
>>> 
>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>> index f0ecd98509d8..d883469a1582 100644
>>> --- a/block/bfq-iosched.c
>>> +++ b/block/bfq-iosched.c
>>> @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, 
>>> struct bio *bio)
>>> bool new_queue = false;
>>> bool bfqq_already_existing = false, split = false;
>>> 
>>> -   if (!rq->elv.icq)
>>> +   if (!rq->elv.icq) {
>>> +   rq->elv.priv[0] = rq->elv.priv[1] = NULL;
>>> return;
>>> +   }
>>> +
>> 
>> This does solve the problem at hand.  But it also arouses a question,
>> related to a possible subtle bug.
>> 
>> For BFQ, !rq->elv.icq basically means "this request is not for me, as
>> I am an icq-based scheduler".  But, IIUC the main points in this
>> thread, then this assumption is false.  If it is actually false, then
>> I hope that all requests with !rq->elv.icq that are sent to BFQ do
>> verify the condition (at_head || blk_rq_is_passthrough(rq)).  In fact,
>> requests that do not verify that condition are those that BFQ must put
>> in a bfq_queue.  So, even if this patch makes the crash disappear, we
>> drive BFQ completely crazy (and we may expect other strange failures)
>> if we send BFQ a request with !((at_head || blk_rq_is_passthrough(rq))
>> and !rq->elv.icq.  BFQ has to put that rq into a bfq_queue, but simply
>> cannot.
>> 
>> Jens, or any other, could you please shed a light on this, and explain
>> how things are exactly?
> 

First, thanks for summing up the problem.

> Your assumption is correct, however you set ->priv[0] and ->priv[1] for
> requests, but only for ->elv.icq != NULL. So let's assume you get a
> request and assign those two, request completes. Later on, you get
> the same request, bypass insert it. BFQ doesn't clear the bic/bfqq
> pointers in the request, since ->elv.icq == NULL.

I'm missing something here.  When the request gets completed in the
first place, the hook bfq_finish_requeue_request gets called, and that
hook clears both ->elv.priv elements (as the request has a non-null
elv.icq).  So, when bfq gets the same request again, those elements
must be NULL.  What am I getting wrong?

I have some more concern on this point, but I'll stick to this for the
moment, to not create more confusion.

Thanks,
Paolo

> It gets inserted
> into the dispatch list.
> 
> Then when __bfq_dispatch_request() is called, you do:
> 
> bfqq = RQ_BFQQ(rq);
> if (bfqq

Re: [PATCH] block: Avoid executing a report or reset zones while a queue is frozen

2018-04-19 Thread h...@lst.de
On Tue, Apr 17, 2018 at 05:35:07PM +, Bart Van Assche wrote:
> > Hmm.  I think we need to avoid clearing that data and update it using
> > RCU instead.  Calling blk_queue_enter before submitting bios is
> > something that would make zone reporting very different from any
> > other block layer user.
> 
> Hello Christoph,
> 
> Please have a look at generic_make_request(). I think that function shows
> that blk_queue_enter() is called anyway before .make_request_fn() is called.

Yes, that is the point.  We already call blk_queue_enter in
generic_make_request, which the zone report and reset ops pass through.