[PATCH] block: Make blk_dequeue_request() static

2017-08-28 Thread Damien Le Moal
The only caller of this function is blk_start_request() in the same
file. Fix blk_start_request() description accordingly.

Signed-off-by: Damien Le Moal 
---
 block/blk-core.c | 5 +
 block/blk.h  | 1 -
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc1af9097dff..d709c0e3a2ac 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2615,7 +2615,7 @@ struct request *blk_peek_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_peek_request);
 
-void blk_dequeue_request(struct request *rq)
+static void blk_dequeue_request(struct request *rq)
 {
struct request_queue *q = rq->q;
 
@@ -2642,9 +2642,6 @@ void blk_dequeue_request(struct request *rq)
  * Description:
  * Dequeue @req and start timeout timer on it.  This hands off the
  * request to the driver.
- *
- * Block internal functions which don't want to start timer should
- * call blk_dequeue_request().
  */
 void blk_start_request(struct request *req)
 {
diff --git a/block/blk.h b/block/blk.h
index fde8b351c166..fcb9775b997d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -64,7 +64,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request 
*rq,
struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
-void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
-- 
2.13.5



Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread Guan Junxiong
Hi, Christoph

On 2017/8/28 22:30, Christoph Hellwig wrote:
> On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote:
>> If a namespace can be accessed by another subsystem,  the above line
>> will ignore such namespace.
> And that's intentional.
>

As for the __nvme_find_ns_head function, can it lookup the namespace
globally, not in the current subsytem. Take hypermetro scenario for
example, two namespaces which should be viewed as the same namespaces
from the database application but exist in two different cities respectively.
Some vendors maybe specify those two namespaces with the same UUID.

In addition, could you add a switch to turn on/off finding namespaces in
a subsystem-wide level or globally?

>> Or does the NVMe/NVMf specification constrain that any namespace
>> can only be accessed by a subsystem?
> Yes, inside the NVMe spec a Namespace is contained inside a Subsystem.
> That doesn't preclude other ways to access the LBAs, but they are outside
> the specification (e.g. also exporting them as SCSI).

Can namespace be shared between two subsystem?  If not, for hypermetro scenario,
we need keep more information of a storage array in city A synchronized with 
the other
in city B.




Re: I/O hangs after resuming from suspend-to-ram

2017-08-28 Thread Ming Lei
On Mon, Aug 28, 2017 at 08:22:26PM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> On pondělí 28. srpna 2017 14:58:28 CEST Ming Lei wrote:
> > Could you verify if the following patch fixes your issue?
> > …SNIP…
> 
> I've applied it to v4.12.9 and rechecked — the issue is still there, 
> unfortunately. Stacktrace is the same as before.
> 
> Were you able to reproduce it in a VM?

Yes, I can.

> 
> Should I re-check it with v4.13-rc7?
> 
> Any other suggestions?

Please test it with v4.13-rc7 first.

-- 
Ming


Re: [PATCH 2/6] nbd: make device_attribute const

2017-08-28 Thread Jens Axboe
On 08/21/2017 05:43 AM, Bhumika Goyal wrote:
> Make this const as is is only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle

Added for 4.14, thanks.

-- 
Jens Axboe



Re: [PATCH] block/nullb: delete unnecessary memory free

2017-08-28 Thread Jens Axboe
On 08/28/2017 03:04 PM, Shaohua Li wrote:
> On Mon, Aug 28, 2017 at 02:55:59PM -0600, Jens Axboe wrote:
>> On 08/28/2017 02:49 PM, Shaohua Li wrote:
>>> Commit 2984c86(nullb: factor disk parameters) has a typo. The
>>> nullb_device allocation/free is done outside of null_add_dev. The commit
>>> accidentally frees the nullb_device in error code path.
>>
>> Is the code in nullb_device_power_store() correct after this?
> 
> Yes, for runtime disk configuration, we allocate the device in
> nullb_group_make_item and free the device in nullb_device_release. the
> nullb_device_power_store only does null_add_dev/null_del_dev, not 
> allocate/free
> dev.

OK good, just wanted to confirm.

>> Also, looks like the second else if in there should be using
>> 'dev' instead of to_nullb_device(item)->power.
> 
> Ah, yes, it should be 'dev->power'. dev == to_nullb_device(item). Do you want
> me to send a separate patch to do the cleanup?

No worries, I'll just do it myself.

-- 
Jens Axboe



Re: [PATCH RFC] Block/blk-wbt: do not let background writes block sync writes

2017-08-28 Thread Jens Axboe
On 08/28/2017 11:45 AM, Liu Bo wrote:
> On Sun, Aug 27, 2017 at 11:14:20AM -0600, Jens Axboe wrote:
>> On 08/25/2017 06:14 PM, Liu Bo wrote:
>>> While using blk-wbt, sometimes sync writes are blocked by background
>>> writes, for example,
>>>
>>> a) a background write reaches the (background) limit returned by
>>> get_limit(), so it's added into the rqw->wait (non kswapd's rqw in
>>> this case) and goes to sleep.
>>>
>>> b) then a sync write gets queued and goes to sleep when finding that
>>> waitqueue_active() returns true and someone else is already on the
>>> waiting list.
>>>
>>> Thus, the sync write will get its rq after the background write
>>> getting rq.
>>>
>>> With this, only background writes will check waitqueue's status and
>>> sync writes will only be throttled by the (max) limit returned by
>>> get_limit().
>>
>> Curious how you ran into this. Is this just speculation, or is it
>> observed behavior?
>>
> 
> I got that from a simple test,
> 
> xfs_io -f -c "falloc 0 4K" /mnt/btrfs/foobar
> xfs_io -f -c "falloc 0 5G" /mnt/btrfs/dummy
> xfs_io -f -c "pwrite -b 128M 0 5G" /mnt/btrfs/dummy &
> 
> off=0
> while true; do
> date
>   let $((off = off + 1))
> xfs_io -s -c "pwrite $off 1"  /mnt/btrfs/foobar > /dev/null
> sleep 1
> done
> 
> ---
> 
> So a lot of writeback is happening in background, and a write with
> O_SYNC is issued every 1 sec.
> 
> My ftrace observation shows that, write with O_SYNC is waiting in
> may_queue() because of the running writeback writes.
> 
>kworker/u16:2-137   [003]   3104.193996: block_bio_queue: 8,64 W 
> 8751312 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194003: may_queue.part.12: inflight 
> 23 limit 24 wait c97cb500
>kworker/u16:2-137   [003]   3104.194003: block_getrq: 8,64 W 8751312 + 
> 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194004: blk_mq_flush_plug_list 
> <-blk_flush_plug_list
>kworker/u16:2-137   [003]   3104.194004: block_unplug: [kworker/u16:2] 
> 1
>kworker/u16:2-137   [003]   3104.194005: block_rq_insert: 8,64 W 
> 1048576 () 8749264 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194006: block_rq_issue: 8,64 W 
> 1048576 () 8749264 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194017: block_plug: [kworker/u16:2]
>kworker/u16:2-137   [003] d.h.  3104.194454: block_rq_complete: 8,64 W () 
> 8704208 + 2048 [0]
>kworker/u16:2-137   [003]   3104.194978: block_bio_queue: 8,64 W 
> 8753360 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194985: may_queue.part.12: inflight 
> 23 limit 24 wait c97cb500
>kworker/u16:2-137   [003]   3104.194985: block_getrq: 8,64 W 8753360 + 
> 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194986: blk_mq_flush_plug_list 
> <-blk_flush_plug_list
>kworker/u16:2-137   [003]   3104.194987: block_unplug: [kworker/u16:2] 
> 1
>kworker/u16:2-137   [003]   3104.194987: block_rq_insert: 8,64 W 
> 1048576 () 8751312 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194990: block_rq_issue: 8,64 W 
> 1048576 () 8751312 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.195001: block_plug: [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.195654: block_bio_queue: 8,64 W 
> 8755408 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.195661: may_queue.part.12: inflight 
> 24 limit 24 wait c97cb500
>kworker/u16:2-137   [003]   3104.195662: may_queue.part.12: inflight 
> 24 limit 24 wait c97cb500
>
> ^^^
>  (wb write 
> puts itself on rqw->wait when reaching the limit)
>  
>kworker/u16:2-137   [003]   3104.195662: io_schedule <-wbt_wait
>kworker/u16:2-137   [003]   3104.195662: blk_mq_flush_plug_list 
> <-blk_flush_plug_list
>kworker/u16:2-137   [003]   3104.195663: block_unplug: [kworker/u16:2] 
> 1
>kworker/u16:2-137   [003]   3104.195663: block_rq_insert: 8,64 W 
> 1048576 () 8753360 + 2048 [kworker/u16:2]
> kworker/3:1H-162   [003]   3104.195680: block_rq_issue: 8,64 W 
> 1048576 () 8753360 + 2048 [kworker/3:1H]
>   -0 [003] d.h.  3104.200533: block_rq_complete: 8,64 W () 
> 8706256 + 2048 [0]
>   -0 [003] d.h.  3104.205473: block_rq_complete: 8,64 W () 
> 8708304 + 2048 [0]
>   -0 [003] d.h.  3104.211451: block_rq_complete: 8,64 W () 
> 8710352 + 2048 [0]
>   -0 [003] d.h.  3104.216163: block_rq_complete: 8,64 W () 
> 8712400 + 2048 [0]
>   -0 [003] d.h.  3104.222072: block_rq_complete: 8,64 W () 
> 8714448 + 2048 [0]
>   -0 [003] d.h.  3104.226931: block_rq_complete: 8,64 W () 
> 8716496 + 2048 [0]
>   -0 [003] d.h.  

Re: [PATCH 1/6] ACPI: make device_attribute const

2017-08-28 Thread Rafael J. Wysocki
On Monday, August 21, 2017 1:43:07 PM CEST Bhumika Goyal wrote:
> Make these const as they are only passed as an argument to the function
> device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle
> 
> Signed-off-by: Bhumika Goyal 
> ---
>  drivers/acpi/battery.c | 2 +-
>  drivers/acpi/sbs.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
> index 1cbb88d..13e7b56 100644
> --- a/drivers/acpi/battery.c
> +++ b/drivers/acpi/battery.c
> @@ -620,7 +620,7 @@ static ssize_t acpi_battery_alarm_store(struct device 
> *dev,
>   return count;
>  }
>  
> -static struct device_attribute alarm_attr = {
> +static const struct device_attribute alarm_attr = {
>   .attr = {.name = "alarm", .mode = 0644},
>   .show = acpi_battery_alarm_show,
>   .store = acpi_battery_alarm_store,
> diff --git a/drivers/acpi/sbs.c b/drivers/acpi/sbs.c
> index a184637..a2428e9 100644
> --- a/drivers/acpi/sbs.c
> +++ b/drivers/acpi/sbs.c
> @@ -474,7 +474,7 @@ static ssize_t acpi_battery_alarm_store(struct device 
> *dev,
>   return count;
>  }
>  
> -static struct device_attribute alarm_attr = {
> +static const struct device_attribute alarm_attr = {
>   .attr = {.name = "alarm", .mode = 0644},
>   .show = acpi_battery_alarm_show,
>   .store = acpi_battery_alarm_store,
> 

Applied, thanks!




Re: [PATCH] block/nullb: delete unnecessary memory free

2017-08-28 Thread Shaohua Li
On Mon, Aug 28, 2017 at 02:55:59PM -0600, Jens Axboe wrote:
> On 08/28/2017 02:49 PM, Shaohua Li wrote:
> > Commit 2984c86(nullb: factor disk parameters) has a typo. The
> > nullb_device allocation/free is done outside of null_add_dev. The commit
> > accidentally frees the nullb_device in error code path.
> 
> Is the code in nullb_device_power_store() correct after this?

Yes, for runtime disk configuration, we allocate the device in
nullb_group_make_item and free the device in nullb_device_release. the
nullb_device_power_store only does null_add_dev/null_del_dev, not allocate/free
dev.

> Also, looks like the second else if in there should be using
> 'dev' instead of to_nullb_device(item)->power.

Ah, yes, it should be 'dev->power'. dev == to_nullb_device(item). Do you want
me to send a separate patch to do the cleanup?

Thanks,
Shaohua


Re: [PATCH] block/nullb: delete unnecessary memory free

2017-08-28 Thread Jens Axboe
On 08/28/2017 02:49 PM, Shaohua Li wrote:
> Commit 2984c86(nullb: factor disk parameters) has a typo. The
> nullb_device allocation/free is done outside of null_add_dev. The commit
> accidentally frees the nullb_device in error code path.

Is the code in nullb_device_power_store() correct after this?
Also, looks like the second else if in there should be using
'dev' instead of to_nullb_device(item)->power.

-- 
Jens Axboe



[PATCH] block/nullb: delete unnecessary memory free

2017-08-28 Thread Shaohua Li
Commit 2984c86(nullb: factor disk parameters) has a typo. The
nullb_device allocation/free is done outside of null_add_dev. The commit
accidentally frees the nullb_device in error code path.

Reported-by: Dan Carpenter 
Signed-off-by: Shaohua Li 
---
 drivers/block/null_blk.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 4d328e3..18f2cb3 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1909,7 +1909,6 @@ static int null_add_dev(struct nullb_device *dev)
 out_free_nullb:
kfree(nullb);
 out:
-   null_free_dev(dev);
return rv;
 }
 
-- 
2.9.5



Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread J Freyensee
On Mon, 2017-08-28 at 09:51 +0300, Sagi Grimberg wrote:
> 
> On 23/08/17 20:58, Christoph Hellwig wrote:
> > Introduce a new struct nvme_ns_head [1] that holds information about
> > an actual namespace, unlike struct nvme_ns, which only holds the
> > per-controller namespace information.  For private namespaces there
> > is a 1:1 relation of the two, but for shared namespaces this lets us
> > discover all the paths to it.  For now only the identifiers are moved
> > to the new structure, but most of the information in struct nvme_ns
> > should eventually move over.
> > 
> > To allow lockless path lookup the list of nvme_ns structures per
> > nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> > structure through call_srcu.
> 
> I haven't read the later patches yet, but what requires sleep in the
> path selection?
> 
> > 
> > [1] comments welcome if you have a better name for it, the current one
> > is
> >  horrible.  One idea would be to rename the current struct nvme_ns
> >  to struct nvme_ns_link or similar and use the nvme_ns name for the
> >  new structure.  But that would involve a lot of churn.
> 
> maybe nvme_ns_primary?

Since it looks like it holds all unique identifier values and should hold
other namespace characteristics later, maybe:

nvme_ns_item?
Or nvme_ns_entry?
Or nvme_ns_element?
Or nvme_ns_unit?
Or nvme_ns_entity?
Or nvme_ns_container?

> > +/*
> > + * Anchor structure for namespaces.  There is one for each namespace
> > in a
> > + * NVMe subsystem that any of our controllers can see, and the
> > namespace
> > + * structure for each controller is chained of it.  For private
> > namespaces
> > + * there is a 1:1 relation to our namespace structures, that is ->list
> > + * only ever has a single entry for private namespaces.
> > + */
> > +struct nvme_ns_head {
> > +   struct list_headlist;
> 
> Maybe siblings is a better name than list,
> and the nvme_ns list_head should be called
> sibling_entry (or just sibling)?


I think that sounds good too.


> 
> > +   struct srcu_struct  srcu;
> > +   unsignedns_id;
> > +   u8  eui64[8];
> > +   u8  nguid[16];
> > +   uuid_t  uuid;
> > +   struct list_headentry;
> > +   struct kref ref;
> > +};
> > +
> >   struct nvme_ns {
> >     struct list_head list;
> >   
> >     struct nvme_ctrl *ctrl;
> >     struct request_queue *queue;
> >     struct gendisk *disk;
> > +   struct list_head siblings;
> >     struct nvm_dev *ndev;
> >     struct kref kref;
> > +   struct nvme_ns_head *head;
> >     int instance;
> >   
> > -   u8 eui[8];
> > -   u8 nguid[16];
> > -   uuid_t uuid;
> > -
> > -   unsigned ns_id;
> >     int lba_shift;
> >     u16 ms;
> >     u16 sgs;
> > 
> 
> 


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Sagi Grimberg

Meh, I don't think that'll work out - the rdma refactoring from
Sagi deeply depends on the unіnit_ctrl split..


I can do a for-4.14/block-postmerge branch, which is for-4.14/block
with -rc7 pulled in. Should apply on top of that.

Did that branch, doesn't apply. But should be easier for you to
rebase on top of that branch. Pushed it out.


I've pushed out a nvme-4.14-rebase branch with the rebase, but I didn't
have time for non-trivial testing yet, that'll have to wait for
tomorrow.

Sagi: can you double check the resolutions around blk_mq_reinit_tagse?


Thanks for doing it Christoph,

It looks good to me, there is one glitch in patch:
"nvme-rdma: reuse configure/destroy_admin_queue"

The "nvme_rdma_configure_admin_queue" block needs to
be squashed into "nvme-rdma: don't free tagset on resets".

I'll do that now.


Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate

2017-08-28 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH 04/10] nvme: remove nvme_revalidate_ns

2017-08-28 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH RFC] Block/blk-wbt: do not let background writes block sync writes

2017-08-28 Thread Liu Bo
On Mon, Aug 28, 2017 at 11:45:31AM -0600, Liu Bo wrote:
> On Sun, Aug 27, 2017 at 11:14:20AM -0600, Jens Axboe wrote:
> > On 08/25/2017 06:14 PM, Liu Bo wrote:
> > > While using blk-wbt, sometimes sync writes are blocked by background
> > > writes, for example,
> > > 
> > > a) a background write reaches the (background) limit returned by
> > > get_limit(), so it's added into the rqw->wait (non kswapd's rqw in
> > > this case) and goes to sleep.
> > > 
> > > b) then a sync write gets queued and goes to sleep when finding that
> > > waitqueue_active() returns true and someone else is already on the
> > > waiting list.
> > > 
> > > Thus, the sync write will get its rq after the background write
> > > getting rq.
> > > 
> > > With this, only background writes will check waitqueue's status and
> > > sync writes will only be throttled by the (max) limit returned by
> > > get_limit().
> > 
> > Curious how you ran into this. Is this just speculation, or is it
> > observed behavior?
> >
> 
> I got that from a simple test,
> 
> xfs_io -f -c "falloc 0 4K" /mnt/btrfs/foobar
> xfs_io -f -c "falloc 0 5G" /mnt/btrfs/dummy
> xfs_io -f -c "pwrite -b 128M 0 5G" /mnt/btrfs/dummy &
> 
> off=0
> while true; do
> date
>   let $((off = off + 1))
> xfs_io -s -c "pwrite $off 1"  /mnt/btrfs/foobar > /dev/null
> sleep 1
> done
> 
> ---
> 
> So a lot of writeback is happening in background, and a write with
> O_SYNC is issued every 1 sec.
> 
> My ftrace observation shows that, write with O_SYNC is waiting in
> may_queue() because of the running writeback writes.
> 
>kworker/u16:2-137   [003]   3104.193996: block_bio_queue: 8,64 W 
> 8751312 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194003: may_queue.part.12: inflight 
> 23 limit 24 wait c97cb500
>kworker/u16:2-137   [003]   3104.194003: block_getrq: 8,64 W 8751312 + 
> 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194004: blk_mq_flush_plug_list 
> <-blk_flush_plug_list
>kworker/u16:2-137   [003]   3104.194004: block_unplug: [kworker/u16:2] 
> 1
>kworker/u16:2-137   [003]   3104.194005: block_rq_insert: 8,64 W 
> 1048576 () 8749264 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194006: block_rq_issue: 8,64 W 
> 1048576 () 8749264 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194017: block_plug: [kworker/u16:2]
>kworker/u16:2-137   [003] d.h.  3104.194454: block_rq_complete: 8,64 W () 
> 8704208 + 2048 [0]
>kworker/u16:2-137   [003]   3104.194978: block_bio_queue: 8,64 W 
> 8753360 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194985: may_queue.part.12: inflight 
> 23 limit 24 wait c97cb500
>kworker/u16:2-137   [003]   3104.194985: block_getrq: 8,64 W 8753360 + 
> 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194986: blk_mq_flush_plug_list 
> <-blk_flush_plug_list
>kworker/u16:2-137   [003]   3104.194987: block_unplug: [kworker/u16:2] 
> 1
>kworker/u16:2-137   [003]   3104.194987: block_rq_insert: 8,64 W 
> 1048576 () 8751312 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.194990: block_rq_issue: 8,64 W 
> 1048576 () 8751312 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.195001: block_plug: [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.195654: block_bio_queue: 8,64 W 
> 8755408 + 2048 [kworker/u16:2]
>kworker/u16:2-137   [003]   3104.195661: may_queue.part.12: inflight 
> 24 limit 24 wait c97cb500
>kworker/u16:2-137   [003]   3104.195662: may_queue.part.12: inflight 
> 24 limit 24 wait c97cb500
>
> ^^^
>  (wb write 
> puts itself on rqw->wait when reaching the limit)
>  
>kworker/u16:2-137   [003]   3104.195662: io_schedule <-wbt_wait
>kworker/u16:2-137   [003]   3104.195662: blk_mq_flush_plug_list 
> <-blk_flush_plug_list
>kworker/u16:2-137   [003]   3104.195663: block_unplug: [kworker/u16:2] 
> 1
>kworker/u16:2-137   [003]   3104.195663: block_rq_insert: 8,64 W 
> 1048576 () 8753360 + 2048 [kworker/u16:2]
> kworker/3:1H-162   [003]   3104.195680: block_rq_issue: 8,64 W 
> 1048576 () 8753360 + 2048 [kworker/3:1H]
>   -0 [003] d.h.  3104.200533: block_rq_complete: 8,64 W () 
> 8706256 + 2048 [0]
>   -0 [003] d.h.  3104.205473: block_rq_complete: 8,64 W () 
> 8708304 + 2048 [0]
>   -0 [003] d.h.  3104.211451: block_rq_complete: 8,64 W () 
> 8710352 + 2048 [0]
>   -0 [003] d.h.  3104.216163: block_rq_complete: 8,64 W () 
> 8712400 + 2048 [0]
>   -0 [003] d.h.  3104.222072: block_rq_complete: 8,64 W () 
> 8714448 + 2048 [0]
>   -0 [003] d.h.  3104.226931: block_rq_complete: 

Re: [PATCH 03/10] nvme: remove unused struct nvme_ns fields

2017-08-28 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Jens Axboe
On 08/28/2017 12:40 PM, Christoph Hellwig wrote:
> Meh, I don't think that'll work out - the rdma refactoring from
> Sagi deeply depends on the unіnit_ctrl split..

I can do a for-4.14/block-postmerge branch, which is for-4.14/block
with -rc7 pulled in. Should apply on top of that.

Did that branch, doesn't apply. But should be easier for you to
rebase on top of that branch. Pushed it out.

-- 
Jens Axboe



[bug report] nullb: add interface to power on disk

2017-08-28 Thread Dan Carpenter
Hello Shaohua Li,

The patch cedcafad8277: "nullb: add interface to power on disk" from
Aug 14, 2017, leads to the following static checker warning:

drivers/block/null_blk.c:372 nullb_device_power_store()
warn: 'dev' was already freed.

drivers/block/null_blk.c:1993 null_init()
error: double free of 'dev'

drivers/block/null_blk.c
   356  
   357  static ssize_t nullb_device_power_store(struct config_item *item,
   358   const char *page, size_t count)
   359  {
   360  struct nullb_device *dev = to_nullb_device(item);
   361  bool newp = false;
   362  ssize_t ret;
   363  
   364  ret = nullb_device_bool_attr_store(, page, count);
   365  if (ret < 0)
   366  return ret;
   367  
   368  if (!dev->power && newp) {
   369  if (test_and_set_bit(NULLB_DEV_FL_UP, >flags))
   370  return count;
   371  if (null_add_dev(dev)) {
^
null_add_dev() frees dev on failure
   372  clear_bit(NULLB_DEV_FL_UP, >flags);
^^
so this is a use after free.  The other time null_add_dev() is called
is buggy as well.  I feel like it shouldn't be freeing "dev" because
that's a layering violation.

   373  return -ENOMEM;
   374  }
   375  
   376  set_bit(NULLB_DEV_FL_CONFIGURED, >flags);
   377  dev->power = newp;
   378  } else if (to_nullb_device(item)->power && !newp) {
   379  mutex_lock();
   380  dev->power = newp;
   381  null_del_dev(dev->nullb);
   382  mutex_unlock();
   383  clear_bit(NULLB_DEV_FL_UP, >flags);
   384  }
   385  
   386  return count;
   387  }

regards,
dan carpenter


Re: [PATCH RFC] Block/blk-wbt: do not let background writes block sync writes

2017-08-28 Thread Liu Bo
On Sun, Aug 27, 2017 at 11:14:20AM -0600, Jens Axboe wrote:
> On 08/25/2017 06:14 PM, Liu Bo wrote:
> > While using blk-wbt, sometimes sync writes are blocked by background
> > writes, for example,
> > 
> > a) a background write reaches the (background) limit returned by
> > get_limit(), so it's added into the rqw->wait (non kswapd's rqw in
> > this case) and goes to sleep.
> > 
> > b) then a sync write gets queued and goes to sleep when finding that
> > waitqueue_active() returns true and someone else is already on the
> > waiting list.
> > 
> > Thus, the sync write will get its rq after the background write
> > getting rq.
> > 
> > With this, only background writes will check waitqueue's status and
> > sync writes will only be throttled by the (max) limit returned by
> > get_limit().
> 
> Curious how you ran into this. Is this just speculation, or is it
> observed behavior?
>

I got that from a simple test,

xfs_io -f -c "falloc 0 4K" /mnt/btrfs/foobar
xfs_io -f -c "falloc 0 5G" /mnt/btrfs/dummy
xfs_io -f -c "pwrite -b 128M 0 5G" /mnt/btrfs/dummy &

off=0
while true; do
date
let $((off = off + 1))
xfs_io -s -c "pwrite $off 1"  /mnt/btrfs/foobar > /dev/null
sleep 1
done

---

So a lot of writeback is happening in background, and a write with
O_SYNC is issued every 1 sec.

My ftrace observation shows that, write with O_SYNC is waiting in
may_queue() because of the running writeback writes.

   kworker/u16:2-137   [003]   3104.193996: block_bio_queue: 8,64 W 8751312 
+ 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194003: may_queue.part.12: inflight 23 
limit 24 wait c97cb500
   kworker/u16:2-137   [003]   3104.194003: block_getrq: 8,64 W 8751312 + 
2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194004: blk_mq_flush_plug_list 
<-blk_flush_plug_list
   kworker/u16:2-137   [003]   3104.194004: block_unplug: [kworker/u16:2] 1
   kworker/u16:2-137   [003]   3104.194005: block_rq_insert: 8,64 W 1048576 
() 8749264 + 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194006: block_rq_issue: 8,64 W 1048576 
() 8749264 + 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194017: block_plug: [kworker/u16:2]
   kworker/u16:2-137   [003] d.h.  3104.194454: block_rq_complete: 8,64 W () 
8704208 + 2048 [0]
   kworker/u16:2-137   [003]   3104.194978: block_bio_queue: 8,64 W 8753360 
+ 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194985: may_queue.part.12: inflight 23 
limit 24 wait c97cb500
   kworker/u16:2-137   [003]   3104.194985: block_getrq: 8,64 W 8753360 + 
2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194986: blk_mq_flush_plug_list 
<-blk_flush_plug_list
   kworker/u16:2-137   [003]   3104.194987: block_unplug: [kworker/u16:2] 1
   kworker/u16:2-137   [003]   3104.194987: block_rq_insert: 8,64 W 1048576 
() 8751312 + 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.194990: block_rq_issue: 8,64 W 1048576 
() 8751312 + 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.195001: block_plug: [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.195654: block_bio_queue: 8,64 W 8755408 
+ 2048 [kworker/u16:2]
   kworker/u16:2-137   [003]   3104.195661: may_queue.part.12: inflight 24 
limit 24 wait c97cb500
   kworker/u16:2-137   [003]   3104.195662: may_queue.part.12: inflight 24 
limit 24 wait c97cb500
   
^^^
   (wb write 
puts itself on rqw->wait when reaching the limit)
   
   kworker/u16:2-137   [003]   3104.195662: io_schedule <-wbt_wait
   kworker/u16:2-137   [003]   3104.195662: blk_mq_flush_plug_list 
<-blk_flush_plug_list
   kworker/u16:2-137   [003]   3104.195663: block_unplug: [kworker/u16:2] 1
   kworker/u16:2-137   [003]   3104.195663: block_rq_insert: 8,64 W 1048576 
() 8753360 + 2048 [kworker/u16:2]
kworker/3:1H-162   [003]   3104.195680: block_rq_issue: 8,64 W 1048576 
() 8753360 + 2048 [kworker/3:1H]
  -0 [003] d.h.  3104.200533: block_rq_complete: 8,64 W () 
8706256 + 2048 [0]
  -0 [003] d.h.  3104.205473: block_rq_complete: 8,64 W () 
8708304 + 2048 [0]
  -0 [003] d.h.  3104.211451: block_rq_complete: 8,64 W () 
8710352 + 2048 [0]
  -0 [003] d.h.  3104.216163: block_rq_complete: 8,64 W () 
8712400 + 2048 [0]
  -0 [003] d.h.  3104.222072: block_rq_complete: 8,64 W () 
8714448 + 2048 [0]
  -0 [003] d.h.  3104.226931: block_rq_complete: 8,64 W () 
8716496 + 2048 [0]
  -0 [003] d.h.  3104.232915: block_rq_complete: 8,64 W () 
8718544 + 2048 [0]
  -0 [003] d.h.  3104.238851: block_rq_complete: 8,64 W () 
8720592 + 2048 [0]
  -0 [003] d.h.  3104.243593: block_rq_complete: 

Re: [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context

2017-08-28 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH 01/10] nvme: report more detailed status codes to the block layer

2017-08-28 Thread Keith Busch
Looks good. 

Reviewed-by: Keith Busch 


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Christoph Hellwig
Meh, I don't think that'll work out - the rdma refactoring from
Sagi deeply depends on the unіnit_ctrl split..


Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-28 Thread Bart Van Assche
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> I still disagree that we should have an indirect function call like this
> in the fast path.
> 
> All that can be done by clearing or setting a flag on the first call to
> ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
> that, but for SCSI we probably can't use that given that it has more
> meaning for the old request path.  But how about just adding a new
> RQD_DRV_INITIALIZED or similar flag?

Hello Christoph,

Sorry but I'm not enthusiast about the proposal to introduce a
RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying
behavior difference, namely that .initialize_rq_fn() would be called from
inside blk_get_request() for pass-through requests and from inside the prep
function for filesystem requests. Another disadvantage of that approach is
that the block layer core never clears request.atomic_flags completely but
only sets and clears individual flags. The SCSI core would have to follow
that model and hence code for clearing RQD_DRV_INITIALIZED would have to be
added to all request completion paths in the SCSI core.

Have you noticed that Ming Lei's patch series introduces several new atomic
operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY
manipulations. Have you noticed that for SCSI drivers these patches introduce
an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path?
I have derived these numbers from the random write SRP performance numbers
as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be
multiplied with the number of I/O requests processed in parallel. The number
of jobs in Ming Lei's test was 64 but that's probably way higher than the
actual I/O parallelism.

Have you noticed that my patch did not add any atomic instructions to the hot
path but only a read of a function pointer that should already be cache hot?
As you know modern CPUs are good at predicting branches. Are you sure that my
patch will have a measurable performance impact?

Bart.

Re: I/O hangs after resuming from suspend-to-ram

2017-08-28 Thread Oleksandr Natalenko
Hi.

On pondělí 28. srpna 2017 14:58:28 CEST Ming Lei wrote:
> Could you verify if the following patch fixes your issue?
> …SNIP…

I've applied it to v4.12.9 and rechecked — the issue is still there, 
unfortunately. Stacktrace is the same as before.

Were you able to reproduce it in a VM?

Should I re-check it with v4.13-rc7?

Any other suggestions?

Thanks.


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Jens Axboe
On 08/28/2017 11:57 AM, Christoph Hellwig wrote:
> On Mon, Aug 28, 2017 at 10:03:39AM -0600, Jens Axboe wrote:
>> But one rock is shittier than the other. If you based it against
>> 4.14, then at least I only have to resolve the the problem once.
>>
>> The real problem is that the patch that went into master had
>> cleanups too. You should push back against that, since it will
>> inevitably cause problems by diverging 4.13 and 4.14 more than
>> absolutely necessary. Having conflicts in nvme for the block tree
>> is not isolated to this series, unfortunately.
> 
> Given that the update doesn't appear to be in your for-next branch
> yet: do you want me to rebase it and resend?

Yes please.

-- 
Jens Axboe



ALPSS: reminder and submission deadline

2017-08-28 Thread Christoph Hellwig
Just one more month to go until the Alpine Linux Persistence and Storage
Summit (ALPSS) will be held from September 27-29 at the Lizumerhuette
in Austria!

If you want to submit a 30-minute talk please do so until Sep 1st, as we
plan to finalize our schedule.  BOFs and team meetings will be scheduled
ad-hoc in the available meeting rooms or outside with a beautiful mountain
panorama.

If you only want to attend you can do so until last minute as long as
space doesn't run.

To submit a talk or request attendance please send a mail to:

alpss...@lists.infradead.org

More details are available on our website:

http://www.alpss.at/

Thank you on behalf of the program committee:

Stephen Bates
Sagi Grimberg
Christoph Hellwig
Johannes Thumshirn
Richard Weinberger


Re: [PATCH] block: Fix warning when I/O elevator is changed as request_queue is being removed

2017-08-28 Thread Jens Axboe
On 08/27/2017 07:36 PM, Ming Lei wrote:
> On Wed, Aug 23, 2017 at 11:34 AM, Ming Lei  wrote:
>> On Wed, Aug 9, 2017 at 9:44 AM, Ming Lei  wrote:
>>> Hi David,
>>>
>>> On Wed, Aug 9, 2017 at 2:13 AM, David Jeffery  wrote:
 On 08/07/2017 07:53 PM, Ming Lei wrote:
> On Tue, Aug 8, 2017 at 3:38 AM, David Jeffery  wrote:

>>
>> Signed-off-by: David Jeffery 
>> ---
>>  block/blk-sysfs.c |2 ++
>>  block/elevator.c  |4 
>>  2 files changed, 6 insertions(+)
>>
>>
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 27aceab..b8362c0 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -931,7 +931,9 @@ void blk_unregister_queue(struct gendisk *disk)
>> if (WARN_ON(!q))
>> return;
>>
>> +   mutex_lock(>sysfs_lock);
>> queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
>> +   mutex_unlock(>sysfs_lock);
>
> Could you share why the lock of 'q->sysfs_lock' is needed here?

 As the elevator change is initiated through a sysfs attr file in the
 queue directory, the task doing the elevator change already acquires the
 q->sysfs_lock before it can try and change the elevator.  Adding the
>>>
>>> It is e->sysfs_lock which is held in elv_attr_store(), instead of 
>>> q->sysfs_lock.
>>
>> Looks I was wrong, and the store is from queue_attr_store() instead of
>> elv_attr_store().
>>
>> I can reproduce the issue and this patch can fix the issue in my test
>> on scsi_debug,
>> so:
>>
>> Tested-by: Ming Lei 
>>
>> And it is a typical race between removing queue kobj and adding children of
>> this queue kobj, we can't acquire q->sysfs_lock in blk_unregister_queue()
>> because of sysfs's limit(may cause deadlock), so one state has to be used
>> for this purpose, just like what register/unregister hctx kobjs does,
>> and it should
>> be fine to use QUEUE_FLAG_REGISTERED here. More changes are required if
>> we use e->registered, so this patch looks fine:
>>
>>Reviewed-by: Ming Lei 
> 
> 
> Hi Jens,
> 
> Could you consider this patch for v4.13 or v4.14?

Yep, added for 4.14, thanks.

-- 
Jens Axboe



Re: [PATCH] block, scheduler: convert xxx_var_store to void

2017-08-28 Thread weiping zhang
On Mon, Aug 28, 2017 at 10:00:46AM -0600, Jens Axboe wrote:
> On 08/28/2017 06:22 AM, weiping zhang wrote:
> > On Fri, Aug 25, 2017 at 01:11:33AM +0800, weiping zhang wrote:
> >> The last parameter "count" never be used in xxx_var_store,
> >> convert these functions to void.
> >>
> >
> > Would you please look this misc patch at your convenience ?
> 
> Looks fine. But please don't send reminders so shortly after sending a
> patch, especially when it's just a cleanup. This was received Thursday.
> 
> -- 
> Jens Axboe
> 

Got it, Thanks! ^_^


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Jens Axboe
On 08/28/2017 09:59 AM, Christoph Hellwig wrote:
> On Mon, Aug 28, 2017 at 09:55:55AM -0600, Jens Axboe wrote:
>>> It was based on the nvme-3.13 tree from about two weeks ago,
>>> which was based on your for-linus at about that time.
>>
>> Changes for 4.14 should be against the 4.14 branch. for-linus is
>> for 4.13. In any case, there will be a conflict against master...
> 
> We would have otherwise had conflicts against changes in Linus'
> tree for 4.13, so we were between a rock and a hard place here.

But one rock is shittier than the other. If you based it against
4.14, then at least I only have to resolve the the problem once.

The real problem is that the patch that went into master had
cleanups too. You should push back against that, since it will
inevitably cause problems by diverging 4.13 and 4.14 more than
absolutely necessary. Having conflicts in nvme for the block tree
is not isolated to this series, unfortunately.

-- 
Jens Axboe



Re: [PATCH] block, scheduler: convert xxx_var_store to void

2017-08-28 Thread Jens Axboe
On 08/28/2017 06:22 AM, weiping zhang wrote:
> On Fri, Aug 25, 2017 at 01:11:33AM +0800, weiping zhang wrote:
>> The last parameter "count" never be used in xxx_var_store,
>> convert these functions to void.
>>
>
> Would you please look this misc patch at your convenience ?

Looks fine. But please don't send reminders so shortly after sending a
patch, especially when it's just a cleanup. This was received Thursday.

-- 
Jens Axboe



Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 09:55:55AM -0600, Jens Axboe wrote:
> > It was based on the nvme-3.13 tree from about two weeks ago,
> > which was based on your for-linus at about that time.
> 
> Changes for 4.14 should be against the 4.14 branch. for-linus is
> for 4.13. In any case, there will be a conflict against master...

We would have otherwise had conflicts against changes in Linus'
tree for 4.13, so we were between a rock and a hard place here.


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Jens Axboe
On 08/28/2017 09:53 AM, Christoph Hellwig wrote:
> On Mon, Aug 28, 2017 at 09:51:38AM -0600, Jens Axboe wrote:
>> What is this against, exactly? It doesn't merge cleanly with my 4.14 branch.
>> >From a quick look, looks like it's conflicting with:
> 
> It was based on the nvme-3.13 tree from about two weeks ago,
> which was based on your for-linus at about that time.

Changes for 4.14 should be against the 4.14 branch. for-linus is
for 4.13. In any case, there will be a conflict against master...

-- 
Jens Axboe



Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 09:51:38AM -0600, Jens Axboe wrote:
> What is this against, exactly? It doesn't merge cleanly with my 4.14 branch.
> >From a quick look, looks like it's conflicting with:

It was based on the nvme-3.13 tree from about two weeks ago,
which was based on your for-linus at about that time.


Re: [GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Jens Axboe
On 08/28/2017 08:33 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> below is the current set of NVMe updates for Linux 4.14.
> 
> The biggest bit comes from Sagi and refactors the RDMA driver to prepare
> for more code sharing in the setup and teardown path.  But we have various
> features and bug fixes from a lot of people as well.
> 
> The following changes since commit 16a5a480f067f945fd27bf91ffdce3f959b0d4b6:
> 
>   nvmet-fc: correct use after free on list teardown (2017-08-16 10:06:18 
> +0200)
> 
> are available in the git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.14
> 
> for you to fetch changes up to 16edd3d0272368eee5cac13356787c1f4d4eb255:
> 
>   nvme: honor RTD3 Entry Latency for shutdowns (2017-08-28 16:29:29 +0200)

What is this against, exactly? It doesn't merge cleanly with my 4.14 branch.
>From a quick look, looks like it's conflicting with:

commit d09f2b45f346f0a9e5e1b5fcea531b1b393671dc
Author: Sagi Grimberg 
Date:   Sun Jul 2 10:56:43 2017 +0300

nvme: split nvme_uninit_ctrl into stop and uninit

which went into master after for-4.14/block was forked off. I can fix it
up, but would be best if you did it.

-- 
Jens Axboe



FUA, PREFLUSH, VWC redux

2017-08-28 Thread John Groves
I'm hoping somebody can clarify the correct PREFLUSH and FUA behavior from a
block stack client.  I'm starting to suspect that some of my devices are
failing to inform the block stack correctly of their capabilities - leading
to expensive and unnecessary flushes.

My understanding is this:

* If I require confirmation that a single write is durable upon
completion, attach REQ_FUA to the bio.
* If I require that the current write plus all previously completed
writes be durable as of completion of the current write, attach
REQ_FUA and REQ_PREFLUSH to the bio.
* Both REQ_FUA and REQ_PREFLUSH should be NOPs on a device with with a
non-volatile write cache.

But I have a small problem - I don't know which write is last, until I
receive a Commit call.  That means that at Commit time, I should send
a "FLUSH".

I am calling a function based on blkdev_issue_flush(), but this causes
I big performance penalty even though my nvme ssd has VWC==0.

Does this make sense?

Perhaps blkdev_issue_flush() is intended to flush caches even if they
are non-volatile?

What is the "correct" way to get a "REQ_PREFLUSH" with a zero length
write, if not via blkdev_issue_flush()?

Thanks for any tips,
John Groves


[GIT PULL] nvme update for Linux 4.14

2017-08-28 Thread Christoph Hellwig
Hi Jens,

below is the current set of NVMe updates for Linux 4.14.

The biggest bit comes from Sagi and refactors the RDMA driver to prepare
for more code sharing in the setup and teardown path.  But we have various
features and bug fixes from a lot of people as well.

The following changes since commit 16a5a480f067f945fd27bf91ffdce3f959b0d4b6:

  nvmet-fc: correct use after free on list teardown (2017-08-16 10:06:18 +0200)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.14

for you to fetch changes up to 16edd3d0272368eee5cac13356787c1f4d4eb255:

  nvme: honor RTD3 Entry Latency for shutdowns (2017-08-28 16:29:29 +0200)


Arnav Dawn (2):
  nvme: add support for FW activation without reset
  nvme: define NVME_NSID_ALL

Christoph Hellwig (1):
  nvmet: use NVME_NSID_ALL

Guan Junxiong (2):
  nvmet: fix the return error code of target if host is not allowed
  nvme-fabrics: log a warning if hostid is invalid

James Smart (2):
  nvme-fc: Reattach to localports on re-registration
  nvmet-fc: simplify sg list handling

Jan H. Schönherr (1):
  nvme: fix uninitialized prp2 value on small transfers

Johannes Thumshirn (2):
  nvmet-fcloop: remove ALL_OPTS define
  nvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE

Jon Derrick (1):
  nvme: add support for NVMe 1.3 Timestamp Feature

Martin K. Petersen (1):
  nvme: honor RTD3 Entry Latency for shutdowns

Martin Wilck (2):
  string.h: add memcpy_and_pad()
  nvmet: use memcpy_and_pad for identify sn/fr

Max Gurtovoy (3):
  nvme: add symbolic constants for CC identifiers
  nvme: rename AMS symbolic constants to fit specification
  nvme-rdma: Use unlikely macro in the fast path

Sagi Grimberg (13):
  nvme-rdma: move nvme_rdma_configure_admin_queue code location
  nvme: Add admin_tagset pointer to nvme_ctrl
  nvme-rdma: move tagset allocation to a dedicated routine
  nvme-rdma: disable the controller on resets
  nvme-rdma: don't free tagset on resets
  nvme-rdma: reuse configure/destroy_admin_queue
  nvme-rdma: introduce configure/destroy io queues
  nvme-rdma: stop queues instead of simply flipping their state
  nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue
  nvme-rdma: introduce nvme_rdma_start_queue
  nvme-rdma: cleanup error path in controller reset
  nvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64
  nvme: fix identify namespace logging

 drivers/nvme/host/core.c| 120 -
 drivers/nvme/host/fabrics.c |   1 +
 drivers/nvme/host/fc.c  | 145 ---
 drivers/nvme/host/nvme.h|   4 +
 drivers/nvme/host/pci.c |   5 +-
 drivers/nvme/host/rdma.c| 559 
 drivers/nvme/target/admin-cmd.c |  16 +-
 drivers/nvme/target/configfs.c  |   2 +-
 drivers/nvme/target/core.c  |  15 +-
 drivers/nvme/target/fc.c|  48 +---
 drivers/nvme/target/fcloop.c|   3 -
 drivers/nvme/target/loop.c  |   1 +
 include/linux/nvme-fc-driver.h  |   2 +-
 include/linux/nvme.h|  37 ++-
 include/linux/string.h  |  30 +++
 15 files changed, 600 insertions(+), 388 deletions(-)


Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote:
> If a namespace can be accessed by another subsystem,  the above line
> will ignore such namespace.

And that's intentional.

> Or does the NVMe/NVMf specification constrain that any namespace
> can only be accessed by a subsystem?

Yes, inside the NVMe spec a Namespace is contained inside a Subsystem.
That doesn't preclude other ways to access the LBAs, but they are outside
the specification (e.g. also exporting them as SCSI).


Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 04:40:43PM +0300, Sagi Grimberg wrote:
> I thought your multipathing code would really live in the block
> layer and only require something really basic from nvme (which could
> easily be applied on other drivers). But I do understand it might
> create a lot of churn.

The earlier versions did a lot more in common code, but I gradually
moved away from that:

 - first I didn't have a separate queue, but just bounced I/O between
   sibling queues.  So there was no new make_request based queue,
   and we had to track the relations in the block layer, with a
   callback to check the path status
 - I got rid of the non-trivial path selector

So in the end very little block layer code remained.  But then again
very little nvme code remained either..

> btw, why are partial completions something that can't be done
> without cloning the bio? is it possible to clone the bio once from the
> completion flow when you see that you got a partial completion?

The problem with partial completions is that blk_update_request
completes bios as soon as it get enough bytes to finish them.

This should not be an unsolvable problem, but it will be a bit messy
at least.  But then again I hope that no new protocols will be designed
with partial completions - SCSI is pretty special in that regard.


Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-28 Thread Sagi Grimberg



the make_request


That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.  The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

  (1) the whole requeue_list list logic.  I thought about moving this
  to the block layer as I'd also have some other use cases for it,
  but decided to wait until those materialize to see if I'd really
  need it.
  (2) turning the logic inside out, e.g. providing a generic make_request
  and supplying a find_path callback to it.  This would require (1)
  but even then we'd save basically no code, add an indirect call
  in the fast path and make things harder to read.  This actually
  is how I started out and didn't like it.


and failover logic


The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.  The only thing that could reasonably be common is
the requeue_list handling as mentioned above.  Note that this will
become even more integrated with the nvme driver once ANA support
lands.


Not arguing with you at all, you obviously gave it a lot of thought.

I thought your multipathing code would really live in the block
layer and only require something really basic from nvme (which could
easily be applied on other drivers). But I do understand it might
create a lot of churn.

btw, why are partial completions something that can't be done
without cloning the bio? is it possible to clone the bio once from the
completion flow when you see that you got a partial completion?


Re: I/O hangs after resuming from suspend-to-ram

2017-08-28 Thread Ming Lei
On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> Ming Lei - 28.08.17, 20:58:
> > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > Hi.
> > > 
> > > Here is disk setup for QEMU VM:
> > > 
> > > ===
> > > [root@archmq ~]# smartctl -i /dev/sda
> > > …
> > > Device Model: QEMU HARDDISK
> > > Serial Number:QM1
> > > Firmware Version: 2.5+
> > > User Capacity:4,294,967,296 bytes [4.29 GB]
> > > Sector Size:  512 bytes logical/physical
> > > Device is:Not in smartctl database [for details use: -P showall]
> > > ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000
> > > Local Time is:Sun Aug 27 09:31:54 2017 CEST
> > > SMART support is: Available - device has SMART capability.
> > > SMART support is: Enabled
> > > 
> > > [root@archmq ~]# lsblk
> > > NAMEMAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
> > > sda   8:004G  0 disk
> > > `-sda18:104G  0 part
> > > 
> > >   `-md0   9:004G  0 raid10
> > >   
> > > `-system253:004G  0 crypt
> > > 
> > >   |-system-boot 253:10  512M  0 lvm/boot
> > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > >   
> > >   `-system-root 253:303G  0 lvm/
> > > 
> > > sdb   8:16   04G  0 disk
> > > `-sdb18:17   04G  0 part
> > > 
> > >   `-md0   9:004G  0 raid10
> > >   
> > > `-system253:004G  0 crypt
> > > 
> > >   |-system-boot 253:10  512M  0 lvm/boot
> > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > >   
> > >   `-system-root 253:303G  0 lvm/
> > > 
> > > sr0  11:01 1024M  0 rom
> > > 
> > > [root@archmq ~]# mdadm --misc --detail /dev/md0
> > > 
> > > /dev/md0:
> > > Version : 1.2
> > >   
> > >   Creation Time : Sat Jul 29 16:37:05 2017
> > >   
> > >  Raid Level : raid10
> > >  Array Size : 4191232 (4.00 GiB 4.29 GB)
> > >   
> > >   Used Dev Size : 4191232 (4.00 GiB 4.29 GB)
> > >   
> > >Raid Devices : 2
> > >   
> > >   Total Devices : 2
> > >   
> > > Persistence : Superblock is persistent
> > > 
> > > Update Time : Sun Aug 27 09:30:33 2017
> > > 
> > >   State : clean
> > >  
> > >  Active Devices : 2
> > > 
> > > Working Devices : 2
> > > 
> > >  Failed Devices : 0
> > >  
> > >   Spare Devices : 0
> > >   
> > >  Layout : far=2
> > >  
> > >  Chunk Size : 512K
> > >  
> > >Name : archiso:0
> > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e
> > >  
> > >  Events : 485
> > > 
> > > Number   Major   Minor   RaidDevice State
> > > 
> > >0   810  active sync   /dev/sda1
> > >1   8   171  active sync   /dev/sdb1
> > > 
> > > ===
> > > 
> > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it,
> > > then
> > > LVM, then ext4 for boot, swap and btrfs for /.
> > > 
> > > I couldn't reproduce the issue with single disk without RAID.
> > 
> > Could you verify if the following patch fixes your issue?
> 
> Could this also apply to non MD RAID systems? I am using BTRFS RAID
> 1 with two SSDs. So far with CFQ it runs stable.

It is for fixing Oleksandr's issue wrt. blk-mq, and looks not for you.

-- 
Ming


Re: I/O hangs after resuming from suspend-to-ram

2017-08-28 Thread Martin Steigerwald
Ming Lei - 28.08.17, 20:58:
> On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > Hi.
> > 
> > Here is disk setup for QEMU VM:
> > 
> > ===
> > [root@archmq ~]# smartctl -i /dev/sda
> > …
> > Device Model: QEMU HARDDISK
> > Serial Number:QM1
> > Firmware Version: 2.5+
> > User Capacity:4,294,967,296 bytes [4.29 GB]
> > Sector Size:  512 bytes logical/physical
> > Device is:Not in smartctl database [for details use: -P showall]
> > ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000
> > Local Time is:Sun Aug 27 09:31:54 2017 CEST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > 
> > [root@archmq ~]# lsblk
> > NAMEMAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
> > sda   8:004G  0 disk
> > `-sda18:104G  0 part
> > 
> >   `-md0   9:004G  0 raid10
> >   
> > `-system253:004G  0 crypt
> > 
> >   |-system-boot 253:10  512M  0 lvm/boot
> >   |-system-swap 253:20  512M  0 lvm[SWAP]
> >   
> >   `-system-root 253:303G  0 lvm/
> > 
> > sdb   8:16   04G  0 disk
> > `-sdb18:17   04G  0 part
> > 
> >   `-md0   9:004G  0 raid10
> >   
> > `-system253:004G  0 crypt
> > 
> >   |-system-boot 253:10  512M  0 lvm/boot
> >   |-system-swap 253:20  512M  0 lvm[SWAP]
> >   
> >   `-system-root 253:303G  0 lvm/
> > 
> > sr0  11:01 1024M  0 rom
> > 
> > [root@archmq ~]# mdadm --misc --detail /dev/md0
> > 
> > /dev/md0:
> > Version : 1.2
> >   
> >   Creation Time : Sat Jul 29 16:37:05 2017
> >   
> >  Raid Level : raid10
> >  Array Size : 4191232 (4.00 GiB 4.29 GB)
> >   
> >   Used Dev Size : 4191232 (4.00 GiB 4.29 GB)
> >   
> >Raid Devices : 2
> >   
> >   Total Devices : 2
> >   
> > Persistence : Superblock is persistent
> > 
> > Update Time : Sun Aug 27 09:30:33 2017
> > 
> >   State : clean
> >  
> >  Active Devices : 2
> > 
> > Working Devices : 2
> > 
> >  Failed Devices : 0
> >  
> >   Spare Devices : 0
> >   
> >  Layout : far=2
> >  
> >  Chunk Size : 512K
> >  
> >Name : archiso:0
> >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e
> >  
> >  Events : 485
> > 
> > Number   Major   Minor   RaidDevice State
> > 
> >0   810  active sync   /dev/sda1
> >1   8   171  active sync   /dev/sdb1
> > 
> > ===
> > 
> > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it,
> > then
> > LVM, then ext4 for boot, swap and btrfs for /.
> > 
> > I couldn't reproduce the issue with single disk without RAID.
> 
> Could you verify if the following patch fixes your issue?

Could this also apply to non MD RAID systems? I am using BTRFS RAID
1 with two SSDs. So far with CFQ it runs stable.

Thanks,
Martin

> From 9fa53d708ebc1d5b87e62e542dc54272529da244 Mon Sep 17 00:00:00 2001
> From: Ming Lei 
> Date: Mon, 28 Aug 2017 19:59:08 +0800
> Subject: [PATCH] blk-mq: align to legacy path for implementing
> blk_execute_rq
> 
> In legacy path, when one request is run via blk_execute_rq(),
> it is added to q->queue_head directly, and I/O scheduler's
> queue is bypassed because either merging or sorting isn't
> needed.
> 
> When SCSI device is put into quiece state, such as during
> system suspend, we need to add the request of RQF_PM into
> head of the queue.
> 
> This patch fixes I/O hang after system resume.
> 
> Reported-by: Oleksandr Natalenko 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-core.c |  2 +-
>  block/blk-exec.c |  2 +-
>  block/blk-flush.c|  2 +-
>  block/blk-mq-sched.c | 58
>  block/blk-mq-sched.h |
>  2 ++
>  5 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index dbecbf4a64e0..fb75bc646ebc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2330,7 +2330,7 @@ blk_status_t blk_insert_cloned_request(struct
> request_queue *q, struct request * if (q->mq_ops) {
>   if (blk_queue_io_stat(q))
>   blk_account_io_start(rq, true);
> - blk_mq_sched_insert_request(rq, false, true, false, false);
> + blk_mq_sched_insert_request_bypass(rq, false, true, false, 
> false);
>   return BLK_STS_OK;
>   }
> 
> diff --git a/block/blk-exec.c b/block/blk-exec.c
> index 5c0f3dc446dc..4565aa6bb624 100644
> --- a/block/blk-exec.c
> +++ b/block/blk-exec.c
> @@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct
> gendisk *bd_disk, * be reused after dying flag is set
>*/
>

Re: I/O hangs after resuming from suspend-to-ram

2017-08-28 Thread Ming Lei
On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> Here is disk setup for QEMU VM:
> 
> ===
> [root@archmq ~]# smartctl -i /dev/sda
> …
> Device Model: QEMU HARDDISK
> Serial Number:QM1
> Firmware Version: 2.5+
> User Capacity:4,294,967,296 bytes [4.29 GB]
> Sector Size:  512 bytes logical/physical
> Device is:Not in smartctl database [for details use: -P showall]
> ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000
> Local Time is:Sun Aug 27 09:31:54 2017 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> [root@archmq ~]# lsblk
> NAMEMAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
> sda   8:004G  0 disk   
> `-sda18:104G  0 part   
>   `-md0   9:004G  0 raid10 
> `-system253:004G  0 crypt  
>   |-system-boot 253:10  512M  0 lvm/boot
>   |-system-swap 253:20  512M  0 lvm[SWAP]
>   `-system-root 253:303G  0 lvm/
> sdb   8:16   04G  0 disk   
> `-sdb18:17   04G  0 part   
>   `-md0   9:004G  0 raid10 
> `-system253:004G  0 crypt  
>   |-system-boot 253:10  512M  0 lvm/boot
>   |-system-swap 253:20  512M  0 lvm[SWAP]
>   `-system-root 253:303G  0 lvm/
> sr0  11:01 1024M  0 rom
> 
> [root@archmq ~]# mdadm --misc --detail /dev/md0
> /dev/md0:
> Version : 1.2
>   Creation Time : Sat Jul 29 16:37:05 2017
>  Raid Level : raid10
>  Array Size : 4191232 (4.00 GiB 4.29 GB)
>   Used Dev Size : 4191232 (4.00 GiB 4.29 GB)
>Raid Devices : 2
>   Total Devices : 2
> Persistence : Superblock is persistent
> 
> Update Time : Sun Aug 27 09:30:33 2017
>   State : clean 
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>  Layout : far=2
>  Chunk Size : 512K
> 
>Name : archiso:0
>UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e
>  Events : 485
> 
> Number   Major   Minor   RaidDevice State
>0   810  active sync   /dev/sda1
>1   8   171  active sync   /dev/sdb1
> ===
> 
> In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it, then 
> LVM, then ext4 for boot, swap and btrfs for /.
> 
> I couldn't reproduce the issue with single disk without RAID.

Could you verify if the following patch fixes your issue?


>From 9fa53d708ebc1d5b87e62e542dc54272529da244 Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Mon, 28 Aug 2017 19:59:08 +0800
Subject: [PATCH] blk-mq: align to legacy path for implementing blk_execute_rq

In legacy path, when one request is run via blk_execute_rq(),
it is added to q->queue_head directly, and I/O scheduler's
queue is bypassed because either merging or sorting isn't
needed.

When SCSI device is put into quiece state, such as during
system suspend, we need to add the request of RQF_PM into
head of the queue.

This patch fixes I/O hang after system resume.

Reported-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c |  2 +-
 block/blk-exec.c |  2 +-
 block/blk-flush.c|  2 +-
 block/blk-mq-sched.c | 58 
 block/blk-mq-sched.h |  2 ++
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index dbecbf4a64e0..fb75bc646ebc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2330,7 +2330,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
-   blk_mq_sched_insert_request(rq, false, true, false, false);
+   blk_mq_sched_insert_request_bypass(rq, false, true, false, 
false);
return BLK_STS_OK;
}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 5c0f3dc446dc..4565aa6bb624 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_sched_insert_request(rq, at_head, true, false, false);
+   blk_mq_sched_insert_request_bypass(rq, at_head, true, false, 
false);
return;
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index ed5fe322abba..51e89e5c525a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if 

Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread Guan Junxiong


On 2017/8/24 1:58, Christoph Hellwig wrote:
> +static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
> + struct nvme_id_ns *id)
> +{
> + struct nvme_ctrl *ctrl = ns->ctrl;
> + bool is_shared = id->nmic & (1 << 0);
> + struct nvme_ns_head *head = NULL;
> + int ret = 0;
> +
> + mutex_lock(>subsys->lock);
> + if (is_shared)
> + head = __nvme_find_ns_head(ctrl->subsys, nsid);

If a namespace can be accessed by another subsystem,  the above line
will ignore such namespace.

Or does the NVMe/NVMf specification constrain that any namespace
can only be accessed by a subsystem?

More comments after testing will be sent later.

Thanks
Guan Junxiong




Re: [PATCH 08/10] block: provide a generic_make_request_fast helper

2017-08-28 Thread Sagi Grimberg



That being said ->make_request basically doesn't care about actual
limits at all, it mostly care about support features (e.g. discard, fua,
etc).  So I think a lot of these limits could porbably be lifted,
but I'd need to add the check to generic_make_request_checks back.


Different virt_boundary capabilities will trigger bio splits which
can make make_request blocking (due to lack of tags).


All the bio splitting is done in blk_queue_split, and other things
related to limits are done even later in blk_mq_make_request when
building the request.  For normal make_request based stacking drivers
nothing of this matters, although a few drivers t call blk_queue_split
manually from their make_request method.


Maybe I misunderstood the change log comment, but didn't it said that
the caller is not allowed to submit a bio that might split?


Re: [PATCH] block, scheduler: convert xxx_var_store to void

2017-08-28 Thread weiping zhang
On Fri, Aug 25, 2017 at 01:11:33AM +0800, weiping zhang wrote:
> The last parameter "count" never be used in xxx_var_store,
> convert these functions to void.
> 
> Signed-off-by: weiping zhang 
> ---
>  block/bfq-iosched.c  | 33 +
>  block/cfq-iosched.c  | 13 ++---
>  block/deadline-iosched.c |  9 -
>  block/mq-deadline.c  |  9 -
>  4 files changed, 31 insertions(+), 33 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 436b6ca..7a4085d 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4787,16 +4787,13 @@ static ssize_t bfq_var_show(unsigned int var, char 
> *page)
>   return sprintf(page, "%u\n", var);
>  }
>  
> -static ssize_t bfq_var_store(unsigned long *var, const char *page,
> -  size_t count)
> +static void bfq_var_store(unsigned long *var, const char *page)
>  {
>   unsigned long new_val;
>   int ret = kstrtoul(page, 10, _val);
>  
>   if (ret == 0)
>   *var = new_val;
> -
> - return count;
>  }
>  
>  #define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
> @@ -4838,7 +4835,7 @@ __FUNC(struct elevator_queue *e, const char *page, 
> size_t count)\
>  {\
>   struct bfq_data *bfqd = e->elevator_data;   \
>   unsigned long uninitialized_var(__data);\
> - int ret = bfq_var_store(&__data, (page), count);\
> + bfq_var_store(&__data, (page)); \
>   if (__data < (MIN)) \
>   __data = (MIN); \
>   else if (__data > (MAX))\
> @@ -4849,7 +4846,7 @@ __FUNC(struct elevator_queue *e, const char *page, 
> size_t count)\
>   *(__PTR) = (u64)__data * NSEC_PER_MSEC; \
>   else\
>   *(__PTR) = __data;  \
> - return ret; \
> + return count;   \
>  }


Hi Jens,

Would you please look this misc patch at your convenience ?

Thanks



Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread javigon
> On 23 Aug 2017, at 19.58, Christoph Hellwig  wrote:
> 
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> [1] comments welcome if you have a better name for it, the current one is
>horrible.  One idea would be to rename the current struct nvme_ns
>to struct nvme_ns_link or similar and use the nvme_ns name for the
>new structure.  But that would involve a lot of churn.
> 
> Signed-off-by: Christoph Hellwig 
> ---
> drivers/nvme/host/core.c | 218 +++
> drivers/nvme/host/lightnvm.c |  14 +--

Nothing big here. Looks good.

Reviewed-by: Javier González 


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 08/10] block: provide a generic_make_request_fast helper

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 02:01:16PM +0300, Sagi Grimberg wrote:
>> That being said ->make_request basically doesn't care about actual
>> limits at all, it mostly care about support features (e.g. discard, fua,
>> etc).  So I think a lot of these limits could porbably be lifted,
>> but I'd need to add the check to generic_make_request_checks back.
>
> Different virt_boundary capabilities will trigger bio splits which
> can make make_request blocking (due to lack of tags).

All the bio splitting is done in blk_queue_split, and other things
related to limits are done even later in blk_mq_make_request when
building the request.  For normal make_request based stacking drivers
nothing of this matters, although a few drivers t call blk_queue_split
manually from their make_request method.


Re: [PATCH 08/10] block: provide a generic_make_request_fast helper

2017-08-28 Thread Sagi Grimberg



This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions,
which requires the caller to not split the bio.


Isn't the same limits constraint too restrictive?

Say I have two paths to the same namespace via two different HBAs, each
with its own virt_boundary capability for example? That would require us
to split failover bio wouldn't it?


Uh oh - different transports for the same subsystem will be intereting.
For one it's not specified anywhere so I'd like to kick off a discussion
on the working group mailing list about it.


Indeed that would be interesting, but I wasn't referring to different
transports, even the same transport can have different capabilities
(for example CX4 and CX4 devices on the same host).


That being said ->make_request basically doesn't care about actual
limits at all, it mostly care about support features (e.g. discard, fua,
etc).  So I think a lot of these limits could porbably be lifted,
but I'd need to add the check to generic_make_request_checks back.


Different virt_boundary capabilities will trigger bio splits which
can make make_request blocking (due to lack of tags).


Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote:
> This is really taking a lot into the nvme driver.

This patch adds about 100 lines of code, out of which 1/4th is the
parsing of status codes.  I don't think it's all that much..

> I'm not sure if
> this approach will be used in other block driver, but would it
> make sense to place the block_device node creation,

I really want to reduce the amount of boilerplate code for creating
a block queue and I have some ideas for that.  But that's not in
any way related to this multi-path code.

> the make_request

That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.  The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

 (1) the whole requeue_list list logic.  I thought about moving this
 to the block layer as I'd also have some other use cases for it,
 but decided to wait until those materialize to see if I'd really
 need it.
 (2) turning the logic inside out, e.g. providing a generic make_request
 and supplying a find_path callback to it.  This would require (1)
 but even then we'd save basically no code, add an indirect call
 in the fast path and make things harder to read.  This actually
 is how I started out and didn't like it.

> and failover logic

The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.  The only thing that could reasonably be common is
the requeue_list handling as mentioned above.  Note that this will
become even more integrated with the nvme driver once ANA support
lands.

> and maybe the path selection in the block layer
> leaving just the construction of the path mappings in nvme?

The point is that path selection fundamentally depends on your
actual protocol.  E.g. for NVMe we now look at the controller state
as that's what our keep alive work on, and what reflects that status
in PCIe. We could try to communicate this up, but it would just lead
to more data structure that get out of sync.  And this will become
much worse with ANA where we have even more fine grained state.
In the end path selection right now is less than 20 lines of code.

Path selection is complicated when you want non-trivial path selector
algorithms like round robin or service time, but I'd really avoid having
those on nvme - we already have multiple queues to go beyong the limits
of a single queue and use blk-mq for that, and once we're beyond the
limits of a single transport path for performance reasons I'd much
rather rely on ANA, numa nodes or static partitioning of namepsaces
than trying to do dynamic decicions in the fast path.  But if I don't
get what I want there and we'll need more complicated algorithms they
absolutely should be in common code.  In fact one of my earlier protoypes
moved the dm-mpath path selector algorithms to core block code and
tried to use them - it just turned out enabling them was pretty much
pointless.


Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-28 Thread Christoph Hellwig
I still disagree that we should have an indirect function call like this
in the fast path.

All that can be done by clearing or setting a flag on the first call to
->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
that, but for SCSI we probablt can't use that given that it has more
meaning for the old request path.  But how about just adding a new
RQD_DRV_INITIALIZED or similar flag?


Re: [PATCH 08/10] block: provide a generic_make_request_fast helper

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 10:00:35AM +0300, Sagi Grimberg wrote:
>> This helper allows reinserting a bio into a new queue without much
>> overhead, but requires all queue limits to be the same for the upper
>> and lower queues, and it does not provide any recursion preventions,
>> which requires the caller to not split the bio.
>
> Isn't the same limits constraint too restrictive?
>
> Say I have two paths to the same namespace via two different HBAs, each
> with its own virt_boundary capability for example? That would require us
> to split failover bio wouldn't it?

Uh oh - different transports for the same subsystem will be intereting.
For one it's not specified anywhere so I'd like to kick off a discussion
on the working group mailing list about it.

That being said ->make_request basically doesn't care about actual
limits at all, it mostly care about support features (e.g. discard, fua,
etc).  So I think a lot of these limits could porbably be lifted,
but I'd need to add the check to generic_make_request_checks back.

>> +/*
>> + * Fast-path version of generic_make_request.
>
> generic_make_request is also called in the fast-path, maybe reword it
> to: "Fast version of generic_make_request"

Yeah.  Maybe generic_make_request_direct or direct_make_request
is a better name as it describes the recursion avoidance bypassing
a little better.


Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 09:51:14AM +0300, Sagi Grimberg wrote:
>> To allow lockless path lookup the list of nvme_ns structures per
>> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
>> structure through call_srcu.
>
> I haven't read the later patches yet, but what requires sleep in the
> path selection?

->make_request is allowed to sleep, and often will.

>> +} else {
>> +u8 eui64[8] = { 0 }, nguid[16] = { 0 };
>> +uuid_t uuid = uuid_null;
>> +
>> +nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, );
>> +if (!uuid_equal(>uuid, ) ||
>> +memcmp(>nguid, , sizeof(head->nguid)) ||
>> +memcmp(>eui64, , sizeof(head->eui64))) {
>
> Suggestion, given that this matching pattern returns in several places
> would it be better to move it to nvme_ns_match_id()?

I'll look into it.  Maybe we'll need a nvme_ns_ids structure to avoid
having tons of parameters, though.

>>   +/*
>> + * Anchor structure for namespaces.  There is one for each namespace in a
>> + * NVMe subsystem that any of our controllers can see, and the namespace
>> + * structure for each controller is chained of it.  For private namespaces
>> + * there is a 1:1 relation to our namespace structures, that is ->list
>> + * only ever has a single entry for private namespaces.
>> + */
>> +struct nvme_ns_head {
>> +struct list_headlist;
>
> Maybe siblings is a better name than list,
> and the nvme_ns list_head should be called
> sibling_entry (or just sibling)?

Yeah.


Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-28 Thread Sagi Grimberg



This patch adds initial multipath support to the nvme driver.  For each
namespace we create a new block device node, which can be used to access
that namespace through any of the controllers that refer to it.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

TODO: implement sysfs interfaces for the new subsystem and
subsystem-namespace object.  Unless we can come up with something
better than sysfs here..

Signed-off-by: Christoph Hellwig 


Christoph,

This is really taking a lot into the nvme driver. I'm not sure if
this approach will be used in other block driver, but would it
make sense to place the block_device node creation, the make_request
and failover logic and maybe the path selection in the block layer
leaving just the construction of the path mappings in nvme?


Re: [PATCH 09/10] blk-mq: add a blk_steal_bios helper

2017-08-28 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 08/10] block: provide a generic_make_request_fast helper

2017-08-28 Thread Sagi Grimberg

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions,
which requires the caller to not split the bio.


Isn't the same limits constraint too restrictive?

Say I have two paths to the same namespace via two different HBAs, each
with its own virt_boundary capability for example? That would require us
to split failover bio wouldn't it?


+/*
+ * Fast-path version of generic_make_request.


generic_make_request is also called in the fast-path, maybe reword it
to: "Fast version of generic_make_request"


Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-28 Thread Sagi Grimberg



On 23/08/17 20:58, Christoph Hellwig wrote:

Introduce a new struct nvme_ns_head [1] that holds information about
an actual namespace, unlike struct nvme_ns, which only holds the
per-controller namespace information.  For private namespaces there
is a 1:1 relation of the two, but for shared namespaces this lets us
discover all the paths to it.  For now only the identifiers are moved
to the new structure, but most of the information in struct nvme_ns
should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.


I haven't read the later patches yet, but what requires sleep in the
path selection?



[1] comments welcome if you have a better name for it, the current one is
 horrible.  One idea would be to rename the current struct nvme_ns
 to struct nvme_ns_link or similar and use the nvme_ns name for the
 new structure.  But that would involve a lot of churn.


maybe nvme_ns_primary?



Signed-off-by: Christoph Hellwig 
---
  drivers/nvme/host/core.c | 218 +++
  drivers/nvme/host/lightnvm.c |  14 +--
  drivers/nvme/host/nvme.h |  26 +-
  3 files changed, 208 insertions(+), 50 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8884000dfbdd..abc5911a8a66 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -249,10 +249,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  }
  EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
  
+static void nvme_destroy_ns_head(struct kref *ref)

+{
+   struct nvme_ns_head *head =
+   container_of(ref, struct nvme_ns_head, ref);
+
+   list_del_init(>entry);
+   cleanup_srcu_struct(>srcu);
+   kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+   kref_put(>ref, nvme_destroy_ns_head);
+}
+
  static void nvme_free_ns(struct kref *kref)
  {
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
  
+	if (ns->head)

+   nvme_put_ns_head(ns->head);
+
if (ns->ndev)
nvme_nvm_unregister(ns);
  
@@ -422,7 +440,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,

  {
memset(cmnd, 0, sizeof(*cmnd));
cmnd->common.opcode = nvme_cmd_flush;
-   cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
  }
  
  static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,

@@ -453,7 +471,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
  
  	memset(cmnd, 0, sizeof(*cmnd));

cmnd->dsm.opcode = nvme_cmd_dsm;
-   cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
cmnd->dsm.nr = cpu_to_le32(segments - 1);
cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
  
@@ -492,7 +510,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
  
  	memset(cmnd, 0, sizeof(*cmnd));

cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-   cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
  
@@ -977,7 +995,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)

memset(, 0, sizeof(c));
c.rw.opcode = io.opcode;
c.rw.flags = io.flags;
-   c.rw.nsid = cpu_to_le32(ns->ns_id);
+   c.rw.nsid = cpu_to_le32(ns->head->ns_id);
c.rw.slba = cpu_to_le64(io.slba);
c.rw.length = cpu_to_le16(io.nblocks);
c.rw.control = cpu_to_le16(io.control);
@@ -1041,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t 
mode,
switch (cmd) {
case NVME_IOCTL_ID:
force_successful_syscall_return();
-   return ns->ns_id;
+   return ns->head->ns_id;
case NVME_IOCTL_ADMIN_CMD:
return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
case NVME_IOCTL_IO_CMD:
@@ -1248,7 +1266,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return -ENODEV;
}
  
-	id = nvme_identify_ns(ctrl, ns->ns_id);

+   id = nvme_identify_ns(ctrl, ns->head->ns_id);
if (!id)
return -ENODEV;
  
@@ -1257,12 +1275,12 @@ static int nvme_revalidate_disk(struct gendisk *disk)

goto out;
}
  
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, );

-   if (!uuid_equal(>uuid, ) ||
-   memcmp(>nguid, , sizeof(ns->nguid)) ||
-   memcmp(>eui, , sizeof(ns->eui))) {
+   nvme_report_ns_ids(ctrl, ns->head->ns_id, id, eui64, nguid, );
+   if (!uuid_equal(>head->uuid, ) ||
+   

Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate

2017-08-28 Thread Sagi Grimberg



Shouldn't uuid,nguid,eui64 record the previous values prior to calling
nvme_report_ns_ids?


The previous values are still in the namespace structure, and we use
the on-stack variables for the current ones.


Of course, misread that one, thanks.


Re: [PATCH 06/10] nvme: track subsystems

2017-08-28 Thread Sagi Grimberg

This looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate

2017-08-28 Thread Christoph Hellwig
On Mon, Aug 28, 2017 at 09:17:44AM +0300, Sagi Grimberg wrote:
>> +u8 eui64[8] = { 0 }, nguid[16] = { 0 };
>> +uuid_t uuid = uuid_null;
>>  int ret = 0;
>>  if (test_bit(NVME_NS_DEAD, >flags)) {
>> @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>>  goto out;
>>  }
>>   -  nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, 
>> >uuid);
>> +nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, );
>> +if (!uuid_equal(>uuid, ) ||
>> +memcmp(>nguid, , sizeof(ns->nguid)) ||
>> +memcmp(>eui, , sizeof(ns->eui))) {
>
> Shouldn't uuid,nguid,eui64 record the previous values prior to calling
> nvme_report_ns_ids?

The previous values are still in the namespace structure, and we use
the on-stack variables for the current ones.


Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate

2017-08-28 Thread Sagi Grimberg

Instead validate that these identifiers do not change, as that is
prohibited by the specification.

Signed-off-by: Christoph Hellwig 
---
  drivers/nvme/host/core.c | 12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 157dbb7b328d..179ade01745b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1236,6 +1236,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
struct nvme_ns *ns = disk->private_data;
struct nvme_ctrl *ctrl = ns->ctrl;
struct nvme_id_ns *id;
+   u8 eui64[8] = { 0 }, nguid[16] = { 0 };
+   uuid_t uuid = uuid_null;
int ret = 0;
  
  	if (test_bit(NVME_NS_DEAD, >flags)) {

@@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
goto out;
}
  
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, >uuid);

+   nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, );
+   if (!uuid_equal(>uuid, ) ||
+   memcmp(>nguid, , sizeof(ns->nguid)) ||
+   memcmp(>eui, , sizeof(ns->eui))) {


Shouldn't uuid,nguid,eui64 record the previous values prior to calling
nvme_report_ns_ids?


Re: [PATCH 04/10] nvme: remove nvme_revalidate_ns

2017-08-28 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 03/10] nvme: remove unused struct nvme_ns fields

2017-08-28 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context

2017-08-28 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 01/10] nvme: report more detailed status codes to the block layer

2017-08-28 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg