[PATCH] block: Make blk_dequeue_request() static
The only caller of this function is blk_start_request() in the same file. Fix blk_start_request() description accordingly. Signed-off-by: Damien Le Moal--- block/blk-core.c | 5 + block/blk.h | 1 - 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index fc1af9097dff..d709c0e3a2ac 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2615,7 +2615,7 @@ struct request *blk_peek_request(struct request_queue *q) } EXPORT_SYMBOL(blk_peek_request); -void blk_dequeue_request(struct request *rq) +static void blk_dequeue_request(struct request *rq) { struct request_queue *q = rq->q; @@ -2642,9 +2642,6 @@ void blk_dequeue_request(struct request *rq) * Description: * Dequeue @req and start timeout timer on it. This hands off the * request to the driver. - * - * Block internal functions which don't want to start timer should - * call blk_dequeue_request(). */ void blk_start_request(struct request *req) { diff --git a/block/blk.h b/block/blk.h index fde8b351c166..fcb9775b997d 100644 --- a/block/blk.h +++ b/block/blk.h @@ -64,7 +64,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq, struct bio *bio); void blk_queue_bypass_start(struct request_queue *q); void blk_queue_bypass_end(struct request_queue *q); -void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); void blk_freeze_queue(struct request_queue *q); -- 2.13.5
Re: [PATCH 07/10] nvme: track shared namespaces
Hi, Christoph On 2017/8/28 22:30, Christoph Hellwig wrote: > On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote: >> If a namespace can be accessed by another subsystem, the above line >> will ignore such namespace. > And that's intentional. > As for the __nvme_find_ns_head function, can it lookup the namespace globally, not in the current subsytem. Take hypermetro scenario for example, two namespaces which should be viewed as the same namespaces from the database application but exist in two different cities respectively. Some vendors maybe specify those two namespaces with the same UUID. In addition, could you add a switch to turn on/off finding namespaces in a subsystem-wide level or globally? >> Or does the NVMe/NVMf specification constrain that any namespace >> can only be accessed by a subsystem? > Yes, inside the NVMe spec a Namespace is contained inside a Subsystem. > That doesn't preclude other ways to access the LBAs, but they are outside > the specification (e.g. also exporting them as SCSI). Can namespace be shared between two subsystem? If not, for hypermetro scenario, we need keep more information of a storage array in city A synchronized with the other in city B.
Re: I/O hangs after resuming from suspend-to-ram
On Mon, Aug 28, 2017 at 08:22:26PM +0200, Oleksandr Natalenko wrote: > Hi. > > On pondělí 28. srpna 2017 14:58:28 CEST Ming Lei wrote: > > Could you verify if the following patch fixes your issue? > > …SNIP… > > I've applied it to v4.12.9 and rechecked — the issue is still there, > unfortunately. Stacktrace is the same as before. > > Were you able to reproduce it in a VM? Yes, I can. > > Should I re-check it with v4.13-rc7? > > Any other suggestions? Please test it with v4.13-rc7 first. -- Ming
Re: [PATCH 2/6] nbd: make device_attribute const
On 08/21/2017 05:43 AM, Bhumika Goyal wrote: > Make this const as is is only passed as an argument to the > function device_create_file and device_remove_file and the corresponding > arguments are of type const. > Done using Coccinelle Added for 4.14, thanks. -- Jens Axboe
Re: [PATCH] block/nullb: delete unnecessary memory free
On 08/28/2017 03:04 PM, Shaohua Li wrote: > On Mon, Aug 28, 2017 at 02:55:59PM -0600, Jens Axboe wrote: >> On 08/28/2017 02:49 PM, Shaohua Li wrote: >>> Commit 2984c86(nullb: factor disk parameters) has a typo. The >>> nullb_device allocation/free is done outside of null_add_dev. The commit >>> accidentally frees the nullb_device in error code path. >> >> Is the code in nullb_device_power_store() correct after this? > > Yes, for runtime disk configuration, we allocate the device in > nullb_group_make_item and free the device in nullb_device_release. the > nullb_device_power_store only does null_add_dev/null_del_dev, not > allocate/free > dev. OK good, just wanted to confirm. >> Also, looks like the second else if in there should be using >> 'dev' instead of to_nullb_device(item)->power. > > Ah, yes, it should be 'dev->power'. dev == to_nullb_device(item). Do you want > me to send a separate patch to do the cleanup? No worries, I'll just do it myself. -- Jens Axboe
Re: [PATCH RFC] Block/blk-wbt: do not let background writes block sync writes
On 08/28/2017 11:45 AM, Liu Bo wrote: > On Sun, Aug 27, 2017 at 11:14:20AM -0600, Jens Axboe wrote: >> On 08/25/2017 06:14 PM, Liu Bo wrote: >>> While using blk-wbt, sometimes sync writes are blocked by background >>> writes, for example, >>> >>> a) a background write reaches the (background) limit returned by >>> get_limit(), so it's added into the rqw->wait (non kswapd's rqw in >>> this case) and goes to sleep. >>> >>> b) then a sync write gets queued and goes to sleep when finding that >>> waitqueue_active() returns true and someone else is already on the >>> waiting list. >>> >>> Thus, the sync write will get its rq after the background write >>> getting rq. >>> >>> With this, only background writes will check waitqueue's status and >>> sync writes will only be throttled by the (max) limit returned by >>> get_limit(). >> >> Curious how you ran into this. Is this just speculation, or is it >> observed behavior? >> > > I got that from a simple test, > > xfs_io -f -c "falloc 0 4K" /mnt/btrfs/foobar > xfs_io -f -c "falloc 0 5G" /mnt/btrfs/dummy > xfs_io -f -c "pwrite -b 128M 0 5G" /mnt/btrfs/dummy & > > off=0 > while true; do > date > let $((off = off + 1)) > xfs_io -s -c "pwrite $off 1" /mnt/btrfs/foobar > /dev/null > sleep 1 > done > > --- > > So a lot of writeback is happening in background, and a write with > O_SYNC is issued every 1 sec. > > My ftrace observation shows that, write with O_SYNC is waiting in > may_queue() because of the running writeback writes. > >kworker/u16:2-137 [003] 3104.193996: block_bio_queue: 8,64 W > 8751312 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194003: may_queue.part.12: inflight > 23 limit 24 wait c97cb500 >kworker/u16:2-137 [003] 3104.194003: block_getrq: 8,64 W 8751312 + > 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194004: blk_mq_flush_plug_list > <-blk_flush_plug_list >kworker/u16:2-137 [003] 3104.194004: block_unplug: [kworker/u16:2] > 1 >kworker/u16:2-137 [003] 3104.194005: block_rq_insert: 8,64 W > 1048576 () 8749264 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194006: block_rq_issue: 8,64 W > 1048576 () 8749264 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194017: block_plug: [kworker/u16:2] >kworker/u16:2-137 [003] d.h. 3104.194454: block_rq_complete: 8,64 W () > 8704208 + 2048 [0] >kworker/u16:2-137 [003] 3104.194978: block_bio_queue: 8,64 W > 8753360 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194985: may_queue.part.12: inflight > 23 limit 24 wait c97cb500 >kworker/u16:2-137 [003] 3104.194985: block_getrq: 8,64 W 8753360 + > 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194986: blk_mq_flush_plug_list > <-blk_flush_plug_list >kworker/u16:2-137 [003] 3104.194987: block_unplug: [kworker/u16:2] > 1 >kworker/u16:2-137 [003] 3104.194987: block_rq_insert: 8,64 W > 1048576 () 8751312 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194990: block_rq_issue: 8,64 W > 1048576 () 8751312 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.195001: block_plug: [kworker/u16:2] >kworker/u16:2-137 [003] 3104.195654: block_bio_queue: 8,64 W > 8755408 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.195661: may_queue.part.12: inflight > 24 limit 24 wait c97cb500 >kworker/u16:2-137 [003] 3104.195662: may_queue.part.12: inflight > 24 limit 24 wait c97cb500 > > ^^^ > (wb write > puts itself on rqw->wait when reaching the limit) > >kworker/u16:2-137 [003] 3104.195662: io_schedule <-wbt_wait >kworker/u16:2-137 [003] 3104.195662: blk_mq_flush_plug_list > <-blk_flush_plug_list >kworker/u16:2-137 [003] 3104.195663: block_unplug: [kworker/u16:2] > 1 >kworker/u16:2-137 [003] 3104.195663: block_rq_insert: 8,64 W > 1048576 () 8753360 + 2048 [kworker/u16:2] > kworker/3:1H-162 [003] 3104.195680: block_rq_issue: 8,64 W > 1048576 () 8753360 + 2048 [kworker/3:1H] > -0 [003] d.h. 3104.200533: block_rq_complete: 8,64 W () > 8706256 + 2048 [0] > -0 [003] d.h. 3104.205473: block_rq_complete: 8,64 W () > 8708304 + 2048 [0] > -0 [003] d.h. 3104.211451: block_rq_complete: 8,64 W () > 8710352 + 2048 [0] > -0 [003] d.h. 3104.216163: block_rq_complete: 8,64 W () > 8712400 + 2048 [0] > -0 [003] d.h. 3104.222072: block_rq_complete: 8,64 W () > 8714448 + 2048 [0] > -0 [003] d.h. 3104.226931: block_rq_complete: 8,64 W () > 8716496 + 2048 [0] > -0 [003] d.h.
Re: [PATCH 1/6] ACPI: make device_attribute const
On Monday, August 21, 2017 1:43:07 PM CEST Bhumika Goyal wrote: > Make these const as they are only passed as an argument to the function > device_create_file and device_remove_file and the corresponding > arguments are of type const. > Done using Coccinelle > > Signed-off-by: Bhumika Goyal> --- > drivers/acpi/battery.c | 2 +- > drivers/acpi/sbs.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c > index 1cbb88d..13e7b56 100644 > --- a/drivers/acpi/battery.c > +++ b/drivers/acpi/battery.c > @@ -620,7 +620,7 @@ static ssize_t acpi_battery_alarm_store(struct device > *dev, > return count; > } > > -static struct device_attribute alarm_attr = { > +static const struct device_attribute alarm_attr = { > .attr = {.name = "alarm", .mode = 0644}, > .show = acpi_battery_alarm_show, > .store = acpi_battery_alarm_store, > diff --git a/drivers/acpi/sbs.c b/drivers/acpi/sbs.c > index a184637..a2428e9 100644 > --- a/drivers/acpi/sbs.c > +++ b/drivers/acpi/sbs.c > @@ -474,7 +474,7 @@ static ssize_t acpi_battery_alarm_store(struct device > *dev, > return count; > } > > -static struct device_attribute alarm_attr = { > +static const struct device_attribute alarm_attr = { > .attr = {.name = "alarm", .mode = 0644}, > .show = acpi_battery_alarm_show, > .store = acpi_battery_alarm_store, > Applied, thanks!
Re: [PATCH] block/nullb: delete unnecessary memory free
On Mon, Aug 28, 2017 at 02:55:59PM -0600, Jens Axboe wrote: > On 08/28/2017 02:49 PM, Shaohua Li wrote: > > Commit 2984c86(nullb: factor disk parameters) has a typo. The > > nullb_device allocation/free is done outside of null_add_dev. The commit > > accidentally frees the nullb_device in error code path. > > Is the code in nullb_device_power_store() correct after this? Yes, for runtime disk configuration, we allocate the device in nullb_group_make_item and free the device in nullb_device_release. the nullb_device_power_store only does null_add_dev/null_del_dev, not allocate/free dev. > Also, looks like the second else if in there should be using > 'dev' instead of to_nullb_device(item)->power. Ah, yes, it should be 'dev->power'. dev == to_nullb_device(item). Do you want me to send a separate patch to do the cleanup? Thanks, Shaohua
Re: [PATCH] block/nullb: delete unnecessary memory free
On 08/28/2017 02:49 PM, Shaohua Li wrote: > Commit 2984c86(nullb: factor disk parameters) has a typo. The > nullb_device allocation/free is done outside of null_add_dev. The commit > accidentally frees the nullb_device in error code path. Is the code in nullb_device_power_store() correct after this? Also, looks like the second else if in there should be using 'dev' instead of to_nullb_device(item)->power. -- Jens Axboe
[PATCH] block/nullb: delete unnecessary memory free
Commit 2984c86(nullb: factor disk parameters) has a typo. The nullb_device allocation/free is done outside of null_add_dev. The commit accidentally frees the nullb_device in error code path. Reported-by: Dan CarpenterSigned-off-by: Shaohua Li --- drivers/block/null_blk.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 4d328e3..18f2cb3 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1909,7 +1909,6 @@ static int null_add_dev(struct nullb_device *dev) out_free_nullb: kfree(nullb); out: - null_free_dev(dev); return rv; } -- 2.9.5
Re: [PATCH 07/10] nvme: track shared namespaces
On Mon, 2017-08-28 at 09:51 +0300, Sagi Grimberg wrote: > > On 23/08/17 20:58, Christoph Hellwig wrote: > > Introduce a new struct nvme_ns_head [1] that holds information about > > an actual namespace, unlike struct nvme_ns, which only holds the > > per-controller namespace information. For private namespaces there > > is a 1:1 relation of the two, but for shared namespaces this lets us > > discover all the paths to it. For now only the identifiers are moved > > to the new structure, but most of the information in struct nvme_ns > > should eventually move over. > > > > To allow lockless path lookup the list of nvme_ns structures per > > nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns > > structure through call_srcu. > > I haven't read the later patches yet, but what requires sleep in the > path selection? > > > > > [1] comments welcome if you have a better name for it, the current one > > is > > horrible. One idea would be to rename the current struct nvme_ns > > to struct nvme_ns_link or similar and use the nvme_ns name for the > > new structure. But that would involve a lot of churn. > > maybe nvme_ns_primary? Since it looks like it holds all unique identifier values and should hold other namespace characteristics later, maybe: nvme_ns_item? Or nvme_ns_entry? Or nvme_ns_element? Or nvme_ns_unit? Or nvme_ns_entity? Or nvme_ns_container? > > +/* > > + * Anchor structure for namespaces. There is one for each namespace > > in a > > + * NVMe subsystem that any of our controllers can see, and the > > namespace > > + * structure for each controller is chained of it. For private > > namespaces > > + * there is a 1:1 relation to our namespace structures, that is ->list > > + * only ever has a single entry for private namespaces. > > + */ > > +struct nvme_ns_head { > > + struct list_headlist; > > Maybe siblings is a better name than list, > and the nvme_ns list_head should be called > sibling_entry (or just sibling)? I think that sounds good too. > > > + struct srcu_struct srcu; > > + unsignedns_id; > > + u8 eui64[8]; > > + u8 nguid[16]; > > + uuid_t uuid; > > + struct list_headentry; > > + struct kref ref; > > +}; > > + > > struct nvme_ns { > > struct list_head list; > > > > struct nvme_ctrl *ctrl; > > struct request_queue *queue; > > struct gendisk *disk; > > + struct list_head siblings; > > struct nvm_dev *ndev; > > struct kref kref; > > + struct nvme_ns_head *head; > > int instance; > > > > - u8 eui[8]; > > - u8 nguid[16]; > > - uuid_t uuid; > > - > > - unsigned ns_id; > > int lba_shift; > > u16 ms; > > u16 sgs; > > > >
Re: [GIT PULL] nvme update for Linux 4.14
Meh, I don't think that'll work out - the rdma refactoring from Sagi deeply depends on the unіnit_ctrl split.. I can do a for-4.14/block-postmerge branch, which is for-4.14/block with -rc7 pulled in. Should apply on top of that. Did that branch, doesn't apply. But should be easier for you to rebase on top of that branch. Pushed it out. I've pushed out a nvme-4.14-rebase branch with the rebase, but I didn't have time for non-trivial testing yet, that'll have to wait for tomorrow. Sagi: can you double check the resolutions around blk_mq_reinit_tagse? Thanks for doing it Christoph, It looks good to me, there is one glitch in patch: "nvme-rdma: reuse configure/destroy_admin_queue" The "nvme_rdma_configure_admin_queue" block needs to be squashed into "nvme-rdma: don't free tagset on resets". I'll do that now.
Re: [PATCH 07/10] nvme: track shared namespaces
Looks good. Reviewed-by: Keith Busch
Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
Looks good. Reviewed-by: Keith Busch
Re: [PATCH 04/10] nvme: remove nvme_revalidate_ns
Looks good. Reviewed-by: Keith Busch
Re: [PATCH RFC] Block/blk-wbt: do not let background writes block sync writes
On Mon, Aug 28, 2017 at 11:45:31AM -0600, Liu Bo wrote: > On Sun, Aug 27, 2017 at 11:14:20AM -0600, Jens Axboe wrote: > > On 08/25/2017 06:14 PM, Liu Bo wrote: > > > While using blk-wbt, sometimes sync writes are blocked by background > > > writes, for example, > > > > > > a) a background write reaches the (background) limit returned by > > > get_limit(), so it's added into the rqw->wait (non kswapd's rqw in > > > this case) and goes to sleep. > > > > > > b) then a sync write gets queued and goes to sleep when finding that > > > waitqueue_active() returns true and someone else is already on the > > > waiting list. > > > > > > Thus, the sync write will get its rq after the background write > > > getting rq. > > > > > > With this, only background writes will check waitqueue's status and > > > sync writes will only be throttled by the (max) limit returned by > > > get_limit(). > > > > Curious how you ran into this. Is this just speculation, or is it > > observed behavior? > > > > I got that from a simple test, > > xfs_io -f -c "falloc 0 4K" /mnt/btrfs/foobar > xfs_io -f -c "falloc 0 5G" /mnt/btrfs/dummy > xfs_io -f -c "pwrite -b 128M 0 5G" /mnt/btrfs/dummy & > > off=0 > while true; do > date > let $((off = off + 1)) > xfs_io -s -c "pwrite $off 1" /mnt/btrfs/foobar > /dev/null > sleep 1 > done > > --- > > So a lot of writeback is happening in background, and a write with > O_SYNC is issued every 1 sec. > > My ftrace observation shows that, write with O_SYNC is waiting in > may_queue() because of the running writeback writes. > >kworker/u16:2-137 [003] 3104.193996: block_bio_queue: 8,64 W > 8751312 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194003: may_queue.part.12: inflight > 23 limit 24 wait c97cb500 >kworker/u16:2-137 [003] 3104.194003: block_getrq: 8,64 W 8751312 + > 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194004: blk_mq_flush_plug_list > <-blk_flush_plug_list >kworker/u16:2-137 [003] 3104.194004: block_unplug: [kworker/u16:2] > 1 >kworker/u16:2-137 [003] 3104.194005: block_rq_insert: 8,64 W > 1048576 () 8749264 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194006: block_rq_issue: 8,64 W > 1048576 () 8749264 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194017: block_plug: [kworker/u16:2] >kworker/u16:2-137 [003] d.h. 3104.194454: block_rq_complete: 8,64 W () > 8704208 + 2048 [0] >kworker/u16:2-137 [003] 3104.194978: block_bio_queue: 8,64 W > 8753360 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194985: may_queue.part.12: inflight > 23 limit 24 wait c97cb500 >kworker/u16:2-137 [003] 3104.194985: block_getrq: 8,64 W 8753360 + > 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194986: blk_mq_flush_plug_list > <-blk_flush_plug_list >kworker/u16:2-137 [003] 3104.194987: block_unplug: [kworker/u16:2] > 1 >kworker/u16:2-137 [003] 3104.194987: block_rq_insert: 8,64 W > 1048576 () 8751312 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.194990: block_rq_issue: 8,64 W > 1048576 () 8751312 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.195001: block_plug: [kworker/u16:2] >kworker/u16:2-137 [003] 3104.195654: block_bio_queue: 8,64 W > 8755408 + 2048 [kworker/u16:2] >kworker/u16:2-137 [003] 3104.195661: may_queue.part.12: inflight > 24 limit 24 wait c97cb500 >kworker/u16:2-137 [003] 3104.195662: may_queue.part.12: inflight > 24 limit 24 wait c97cb500 > > ^^^ > (wb write > puts itself on rqw->wait when reaching the limit) > >kworker/u16:2-137 [003] 3104.195662: io_schedule <-wbt_wait >kworker/u16:2-137 [003] 3104.195662: blk_mq_flush_plug_list > <-blk_flush_plug_list >kworker/u16:2-137 [003] 3104.195663: block_unplug: [kworker/u16:2] > 1 >kworker/u16:2-137 [003] 3104.195663: block_rq_insert: 8,64 W > 1048576 () 8753360 + 2048 [kworker/u16:2] > kworker/3:1H-162 [003] 3104.195680: block_rq_issue: 8,64 W > 1048576 () 8753360 + 2048 [kworker/3:1H] > -0 [003] d.h. 3104.200533: block_rq_complete: 8,64 W () > 8706256 + 2048 [0] > -0 [003] d.h. 3104.205473: block_rq_complete: 8,64 W () > 8708304 + 2048 [0] > -0 [003] d.h. 3104.211451: block_rq_complete: 8,64 W () > 8710352 + 2048 [0] > -0 [003] d.h. 3104.216163: block_rq_complete: 8,64 W () > 8712400 + 2048 [0] > -0 [003] d.h. 3104.222072: block_rq_complete: 8,64 W () > 8714448 + 2048 [0] > -0 [003] d.h. 3104.226931: block_rq_complete:
Re: [PATCH 03/10] nvme: remove unused struct nvme_ns fields
Looks good. Reviewed-by: Keith Busch
Re: [GIT PULL] nvme update for Linux 4.14
On 08/28/2017 12:40 PM, Christoph Hellwig wrote: > Meh, I don't think that'll work out - the rdma refactoring from > Sagi deeply depends on the unіnit_ctrl split.. I can do a for-4.14/block-postmerge branch, which is for-4.14/block with -rc7 pulled in. Should apply on top of that. Did that branch, doesn't apply. But should be easier for you to rebase on top of that branch. Pushed it out. -- Jens Axboe
[bug report] nullb: add interface to power on disk
Hello Shaohua Li, The patch cedcafad8277: "nullb: add interface to power on disk" from Aug 14, 2017, leads to the following static checker warning: drivers/block/null_blk.c:372 nullb_device_power_store() warn: 'dev' was already freed. drivers/block/null_blk.c:1993 null_init() error: double free of 'dev' drivers/block/null_blk.c 356 357 static ssize_t nullb_device_power_store(struct config_item *item, 358 const char *page, size_t count) 359 { 360 struct nullb_device *dev = to_nullb_device(item); 361 bool newp = false; 362 ssize_t ret; 363 364 ret = nullb_device_bool_attr_store(, page, count); 365 if (ret < 0) 366 return ret; 367 368 if (!dev->power && newp) { 369 if (test_and_set_bit(NULLB_DEV_FL_UP, >flags)) 370 return count; 371 if (null_add_dev(dev)) { ^ null_add_dev() frees dev on failure 372 clear_bit(NULLB_DEV_FL_UP, >flags); ^^ so this is a use after free. The other time null_add_dev() is called is buggy as well. I feel like it shouldn't be freeing "dev" because that's a layering violation. 373 return -ENOMEM; 374 } 375 376 set_bit(NULLB_DEV_FL_CONFIGURED, >flags); 377 dev->power = newp; 378 } else if (to_nullb_device(item)->power && !newp) { 379 mutex_lock(); 380 dev->power = newp; 381 null_del_dev(dev->nullb); 382 mutex_unlock(); 383 clear_bit(NULLB_DEV_FL_UP, >flags); 384 } 385 386 return count; 387 } regards, dan carpenter
Re: [PATCH RFC] Block/blk-wbt: do not let background writes block sync writes
On Sun, Aug 27, 2017 at 11:14:20AM -0600, Jens Axboe wrote: > On 08/25/2017 06:14 PM, Liu Bo wrote: > > While using blk-wbt, sometimes sync writes are blocked by background > > writes, for example, > > > > a) a background write reaches the (background) limit returned by > > get_limit(), so it's added into the rqw->wait (non kswapd's rqw in > > this case) and goes to sleep. > > > > b) then a sync write gets queued and goes to sleep when finding that > > waitqueue_active() returns true and someone else is already on the > > waiting list. > > > > Thus, the sync write will get its rq after the background write > > getting rq. > > > > With this, only background writes will check waitqueue's status and > > sync writes will only be throttled by the (max) limit returned by > > get_limit(). > > Curious how you ran into this. Is this just speculation, or is it > observed behavior? > I got that from a simple test, xfs_io -f -c "falloc 0 4K" /mnt/btrfs/foobar xfs_io -f -c "falloc 0 5G" /mnt/btrfs/dummy xfs_io -f -c "pwrite -b 128M 0 5G" /mnt/btrfs/dummy & off=0 while true; do date let $((off = off + 1)) xfs_io -s -c "pwrite $off 1" /mnt/btrfs/foobar > /dev/null sleep 1 done --- So a lot of writeback is happening in background, and a write with O_SYNC is issued every 1 sec. My ftrace observation shows that, write with O_SYNC is waiting in may_queue() because of the running writeback writes. kworker/u16:2-137 [003] 3104.193996: block_bio_queue: 8,64 W 8751312 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194003: may_queue.part.12: inflight 23 limit 24 wait c97cb500 kworker/u16:2-137 [003] 3104.194003: block_getrq: 8,64 W 8751312 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194004: blk_mq_flush_plug_list <-blk_flush_plug_list kworker/u16:2-137 [003] 3104.194004: block_unplug: [kworker/u16:2] 1 kworker/u16:2-137 [003] 3104.194005: block_rq_insert: 8,64 W 1048576 () 8749264 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194006: block_rq_issue: 8,64 W 1048576 () 8749264 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194017: block_plug: [kworker/u16:2] kworker/u16:2-137 [003] d.h. 3104.194454: block_rq_complete: 8,64 W () 8704208 + 2048 [0] kworker/u16:2-137 [003] 3104.194978: block_bio_queue: 8,64 W 8753360 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194985: may_queue.part.12: inflight 23 limit 24 wait c97cb500 kworker/u16:2-137 [003] 3104.194985: block_getrq: 8,64 W 8753360 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194986: blk_mq_flush_plug_list <-blk_flush_plug_list kworker/u16:2-137 [003] 3104.194987: block_unplug: [kworker/u16:2] 1 kworker/u16:2-137 [003] 3104.194987: block_rq_insert: 8,64 W 1048576 () 8751312 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.194990: block_rq_issue: 8,64 W 1048576 () 8751312 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.195001: block_plug: [kworker/u16:2] kworker/u16:2-137 [003] 3104.195654: block_bio_queue: 8,64 W 8755408 + 2048 [kworker/u16:2] kworker/u16:2-137 [003] 3104.195661: may_queue.part.12: inflight 24 limit 24 wait c97cb500 kworker/u16:2-137 [003] 3104.195662: may_queue.part.12: inflight 24 limit 24 wait c97cb500 ^^^ (wb write puts itself on rqw->wait when reaching the limit) kworker/u16:2-137 [003] 3104.195662: io_schedule <-wbt_wait kworker/u16:2-137 [003] 3104.195662: blk_mq_flush_plug_list <-blk_flush_plug_list kworker/u16:2-137 [003] 3104.195663: block_unplug: [kworker/u16:2] 1 kworker/u16:2-137 [003] 3104.195663: block_rq_insert: 8,64 W 1048576 () 8753360 + 2048 [kworker/u16:2] kworker/3:1H-162 [003] 3104.195680: block_rq_issue: 8,64 W 1048576 () 8753360 + 2048 [kworker/3:1H] -0 [003] d.h. 3104.200533: block_rq_complete: 8,64 W () 8706256 + 2048 [0] -0 [003] d.h. 3104.205473: block_rq_complete: 8,64 W () 8708304 + 2048 [0] -0 [003] d.h. 3104.211451: block_rq_complete: 8,64 W () 8710352 + 2048 [0] -0 [003] d.h. 3104.216163: block_rq_complete: 8,64 W () 8712400 + 2048 [0] -0 [003] d.h. 3104.222072: block_rq_complete: 8,64 W () 8714448 + 2048 [0] -0 [003] d.h. 3104.226931: block_rq_complete: 8,64 W () 8716496 + 2048 [0] -0 [003] d.h. 3104.232915: block_rq_complete: 8,64 W () 8718544 + 2048 [0] -0 [003] d.h. 3104.238851: block_rq_complete: 8,64 W () 8720592 + 2048 [0] -0 [003] d.h. 3104.243593: block_rq_complete:
Re: [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
Looks good. Reviewed-by: Keith Busch
Re: [PATCH 01/10] nvme: report more detailed status codes to the block layer
Looks good. Reviewed-by: Keith Busch
Re: [GIT PULL] nvme update for Linux 4.14
Meh, I don't think that'll work out - the rdma refactoring from Sagi deeply depends on the unіnit_ctrl split..
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: > I still disagree that we should have an indirect function call like this > in the fast path. > > All that can be done by clearing or setting a flag on the first call to > ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for > that, but for SCSI we probably can't use that given that it has more > meaning for the old request path. But how about just adding a new > RQD_DRV_INITIALIZED or similar flag? Hello Christoph, Sorry but I'm not enthusiast about the proposal to introduce a RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying behavior difference, namely that .initialize_rq_fn() would be called from inside blk_get_request() for pass-through requests and from inside the prep function for filesystem requests. Another disadvantage of that approach is that the block layer core never clears request.atomic_flags completely but only sets and clears individual flags. The SCSI core would have to follow that model and hence code for clearing RQD_DRV_INITIALIZED would have to be added to all request completion paths in the SCSI core. Have you noticed that Ming Lei's patch series introduces several new atomic operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY manipulations. Have you noticed that for SCSI drivers these patches introduce an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path? I have derived these numbers from the random write SRP performance numbers as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be multiplied with the number of I/O requests processed in parallel. The number of jobs in Ming Lei's test was 64 but that's probably way higher than the actual I/O parallelism. Have you noticed that my patch did not add any atomic instructions to the hot path but only a read of a function pointer that should already be cache hot? As you know modern CPUs are good at predicting branches. Are you sure that my patch will have a measurable performance impact? Bart.
Re: I/O hangs after resuming from suspend-to-ram
Hi. On pondělí 28. srpna 2017 14:58:28 CEST Ming Lei wrote: > Could you verify if the following patch fixes your issue? > …SNIP… I've applied it to v4.12.9 and rechecked — the issue is still there, unfortunately. Stacktrace is the same as before. Were you able to reproduce it in a VM? Should I re-check it with v4.13-rc7? Any other suggestions? Thanks.
Re: [GIT PULL] nvme update for Linux 4.14
On 08/28/2017 11:57 AM, Christoph Hellwig wrote: > On Mon, Aug 28, 2017 at 10:03:39AM -0600, Jens Axboe wrote: >> But one rock is shittier than the other. If you based it against >> 4.14, then at least I only have to resolve the the problem once. >> >> The real problem is that the patch that went into master had >> cleanups too. You should push back against that, since it will >> inevitably cause problems by diverging 4.13 and 4.14 more than >> absolutely necessary. Having conflicts in nvme for the block tree >> is not isolated to this series, unfortunately. > > Given that the update doesn't appear to be in your for-next branch > yet: do you want me to rebase it and resend? Yes please. -- Jens Axboe
ALPSS: reminder and submission deadline
Just one more month to go until the Alpine Linux Persistence and Storage Summit (ALPSS) will be held from September 27-29 at the Lizumerhuette in Austria! If you want to submit a 30-minute talk please do so until Sep 1st, as we plan to finalize our schedule. BOFs and team meetings will be scheduled ad-hoc in the available meeting rooms or outside with a beautiful mountain panorama. If you only want to attend you can do so until last minute as long as space doesn't run. To submit a talk or request attendance please send a mail to: alpss...@lists.infradead.org More details are available on our website: http://www.alpss.at/ Thank you on behalf of the program committee: Stephen Bates Sagi Grimberg Christoph Hellwig Johannes Thumshirn Richard Weinberger
Re: [PATCH] block: Fix warning when I/O elevator is changed as request_queue is being removed
On 08/27/2017 07:36 PM, Ming Lei wrote: > On Wed, Aug 23, 2017 at 11:34 AM, Ming Leiwrote: >> On Wed, Aug 9, 2017 at 9:44 AM, Ming Lei wrote: >>> Hi David, >>> >>> On Wed, Aug 9, 2017 at 2:13 AM, David Jeffery wrote: On 08/07/2017 07:53 PM, Ming Lei wrote: > On Tue, Aug 8, 2017 at 3:38 AM, David Jeffery wrote: >> >> Signed-off-by: David Jeffery >> --- >> block/blk-sysfs.c |2 ++ >> block/elevator.c |4 >> 2 files changed, 6 insertions(+) >> >> >> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >> index 27aceab..b8362c0 100644 >> --- a/block/blk-sysfs.c >> +++ b/block/blk-sysfs.c >> @@ -931,7 +931,9 @@ void blk_unregister_queue(struct gendisk *disk) >> if (WARN_ON(!q)) >> return; >> >> + mutex_lock(>sysfs_lock); >> queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); >> + mutex_unlock(>sysfs_lock); > > Could you share why the lock of 'q->sysfs_lock' is needed here? As the elevator change is initiated through a sysfs attr file in the queue directory, the task doing the elevator change already acquires the q->sysfs_lock before it can try and change the elevator. Adding the >>> >>> It is e->sysfs_lock which is held in elv_attr_store(), instead of >>> q->sysfs_lock. >> >> Looks I was wrong, and the store is from queue_attr_store() instead of >> elv_attr_store(). >> >> I can reproduce the issue and this patch can fix the issue in my test >> on scsi_debug, >> so: >> >> Tested-by: Ming Lei >> >> And it is a typical race between removing queue kobj and adding children of >> this queue kobj, we can't acquire q->sysfs_lock in blk_unregister_queue() >> because of sysfs's limit(may cause deadlock), so one state has to be used >> for this purpose, just like what register/unregister hctx kobjs does, >> and it should >> be fine to use QUEUE_FLAG_REGISTERED here. More changes are required if >> we use e->registered, so this patch looks fine: >> >>Reviewed-by: Ming Lei > > > Hi Jens, > > Could you consider this patch for v4.13 or v4.14? Yep, added for 4.14, thanks. -- Jens Axboe
Re: [PATCH] block, scheduler: convert xxx_var_store to void
On Mon, Aug 28, 2017 at 10:00:46AM -0600, Jens Axboe wrote: > On 08/28/2017 06:22 AM, weiping zhang wrote: > > On Fri, Aug 25, 2017 at 01:11:33AM +0800, weiping zhang wrote: > >> The last parameter "count" never be used in xxx_var_store, > >> convert these functions to void. > >> > > > > Would you please look this misc patch at your convenience ? > > Looks fine. But please don't send reminders so shortly after sending a > patch, especially when it's just a cleanup. This was received Thursday. > > -- > Jens Axboe > Got it, Thanks! ^_^
Re: [GIT PULL] nvme update for Linux 4.14
On 08/28/2017 09:59 AM, Christoph Hellwig wrote: > On Mon, Aug 28, 2017 at 09:55:55AM -0600, Jens Axboe wrote: >>> It was based on the nvme-3.13 tree from about two weeks ago, >>> which was based on your for-linus at about that time. >> >> Changes for 4.14 should be against the 4.14 branch. for-linus is >> for 4.13. In any case, there will be a conflict against master... > > We would have otherwise had conflicts against changes in Linus' > tree for 4.13, so we were between a rock and a hard place here. But one rock is shittier than the other. If you based it against 4.14, then at least I only have to resolve the the problem once. The real problem is that the patch that went into master had cleanups too. You should push back against that, since it will inevitably cause problems by diverging 4.13 and 4.14 more than absolutely necessary. Having conflicts in nvme for the block tree is not isolated to this series, unfortunately. -- Jens Axboe
Re: [PATCH] block, scheduler: convert xxx_var_store to void
On 08/28/2017 06:22 AM, weiping zhang wrote: > On Fri, Aug 25, 2017 at 01:11:33AM +0800, weiping zhang wrote: >> The last parameter "count" never be used in xxx_var_store, >> convert these functions to void. >> > > Would you please look this misc patch at your convenience ? Looks fine. But please don't send reminders so shortly after sending a patch, especially when it's just a cleanup. This was received Thursday. -- Jens Axboe
Re: [GIT PULL] nvme update for Linux 4.14
On Mon, Aug 28, 2017 at 09:55:55AM -0600, Jens Axboe wrote: > > It was based on the nvme-3.13 tree from about two weeks ago, > > which was based on your for-linus at about that time. > > Changes for 4.14 should be against the 4.14 branch. for-linus is > for 4.13. In any case, there will be a conflict against master... We would have otherwise had conflicts against changes in Linus' tree for 4.13, so we were between a rock and a hard place here.
Re: [GIT PULL] nvme update for Linux 4.14
On 08/28/2017 09:53 AM, Christoph Hellwig wrote: > On Mon, Aug 28, 2017 at 09:51:38AM -0600, Jens Axboe wrote: >> What is this against, exactly? It doesn't merge cleanly with my 4.14 branch. >> >From a quick look, looks like it's conflicting with: > > It was based on the nvme-3.13 tree from about two weeks ago, > which was based on your for-linus at about that time. Changes for 4.14 should be against the 4.14 branch. for-linus is for 4.13. In any case, there will be a conflict against master... -- Jens Axboe
Re: [GIT PULL] nvme update for Linux 4.14
On Mon, Aug 28, 2017 at 09:51:38AM -0600, Jens Axboe wrote: > What is this against, exactly? It doesn't merge cleanly with my 4.14 branch. > >From a quick look, looks like it's conflicting with: It was based on the nvme-3.13 tree from about two weeks ago, which was based on your for-linus at about that time.
Re: [GIT PULL] nvme update for Linux 4.14
On 08/28/2017 08:33 AM, Christoph Hellwig wrote: > Hi Jens, > > below is the current set of NVMe updates for Linux 4.14. > > The biggest bit comes from Sagi and refactors the RDMA driver to prepare > for more code sharing in the setup and teardown path. But we have various > features and bug fixes from a lot of people as well. > > The following changes since commit 16a5a480f067f945fd27bf91ffdce3f959b0d4b6: > > nvmet-fc: correct use after free on list teardown (2017-08-16 10:06:18 > +0200) > > are available in the git repository at: > > git://git.infradead.org/nvme.git nvme-4.14 > > for you to fetch changes up to 16edd3d0272368eee5cac13356787c1f4d4eb255: > > nvme: honor RTD3 Entry Latency for shutdowns (2017-08-28 16:29:29 +0200) What is this against, exactly? It doesn't merge cleanly with my 4.14 branch. >From a quick look, looks like it's conflicting with: commit d09f2b45f346f0a9e5e1b5fcea531b1b393671dc Author: Sagi GrimbergDate: Sun Jul 2 10:56:43 2017 +0300 nvme: split nvme_uninit_ctrl into stop and uninit which went into master after for-4.14/block was forked off. I can fix it up, but would be best if you did it. -- Jens Axboe
FUA, PREFLUSH, VWC redux
I'm hoping somebody can clarify the correct PREFLUSH and FUA behavior from a block stack client. I'm starting to suspect that some of my devices are failing to inform the block stack correctly of their capabilities - leading to expensive and unnecessary flushes. My understanding is this: * If I require confirmation that a single write is durable upon completion, attach REQ_FUA to the bio. * If I require that the current write plus all previously completed writes be durable as of completion of the current write, attach REQ_FUA and REQ_PREFLUSH to the bio. * Both REQ_FUA and REQ_PREFLUSH should be NOPs on a device with with a non-volatile write cache. But I have a small problem - I don't know which write is last, until I receive a Commit call. That means that at Commit time, I should send a "FLUSH". I am calling a function based on blkdev_issue_flush(), but this causes I big performance penalty even though my nvme ssd has VWC==0. Does this make sense? Perhaps blkdev_issue_flush() is intended to flush caches even if they are non-volatile? What is the "correct" way to get a "REQ_PREFLUSH" with a zero length write, if not via blkdev_issue_flush()? Thanks for any tips, John Groves
[GIT PULL] nvme update for Linux 4.14
Hi Jens, below is the current set of NVMe updates for Linux 4.14. The biggest bit comes from Sagi and refactors the RDMA driver to prepare for more code sharing in the setup and teardown path. But we have various features and bug fixes from a lot of people as well. The following changes since commit 16a5a480f067f945fd27bf91ffdce3f959b0d4b6: nvmet-fc: correct use after free on list teardown (2017-08-16 10:06:18 +0200) are available in the git repository at: git://git.infradead.org/nvme.git nvme-4.14 for you to fetch changes up to 16edd3d0272368eee5cac13356787c1f4d4eb255: nvme: honor RTD3 Entry Latency for shutdowns (2017-08-28 16:29:29 +0200) Arnav Dawn (2): nvme: add support for FW activation without reset nvme: define NVME_NSID_ALL Christoph Hellwig (1): nvmet: use NVME_NSID_ALL Guan Junxiong (2): nvmet: fix the return error code of target if host is not allowed nvme-fabrics: log a warning if hostid is invalid James Smart (2): nvme-fc: Reattach to localports on re-registration nvmet-fc: simplify sg list handling Jan H. Schönherr (1): nvme: fix uninitialized prp2 value on small transfers Johannes Thumshirn (2): nvmet-fcloop: remove ALL_OPTS define nvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE Jon Derrick (1): nvme: add support for NVMe 1.3 Timestamp Feature Martin K. Petersen (1): nvme: honor RTD3 Entry Latency for shutdowns Martin Wilck (2): string.h: add memcpy_and_pad() nvmet: use memcpy_and_pad for identify sn/fr Max Gurtovoy (3): nvme: add symbolic constants for CC identifiers nvme: rename AMS symbolic constants to fit specification nvme-rdma: Use unlikely macro in the fast path Sagi Grimberg (13): nvme-rdma: move nvme_rdma_configure_admin_queue code location nvme: Add admin_tagset pointer to nvme_ctrl nvme-rdma: move tagset allocation to a dedicated routine nvme-rdma: disable the controller on resets nvme-rdma: don't free tagset on resets nvme-rdma: reuse configure/destroy_admin_queue nvme-rdma: introduce configure/destroy io queues nvme-rdma: stop queues instead of simply flipping their state nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue nvme-rdma: introduce nvme_rdma_start_queue nvme-rdma: cleanup error path in controller reset nvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64 nvme: fix identify namespace logging drivers/nvme/host/core.c| 120 - drivers/nvme/host/fabrics.c | 1 + drivers/nvme/host/fc.c | 145 --- drivers/nvme/host/nvme.h| 4 + drivers/nvme/host/pci.c | 5 +- drivers/nvme/host/rdma.c| 559 drivers/nvme/target/admin-cmd.c | 16 +- drivers/nvme/target/configfs.c | 2 +- drivers/nvme/target/core.c | 15 +- drivers/nvme/target/fc.c| 48 +--- drivers/nvme/target/fcloop.c| 3 - drivers/nvme/target/loop.c | 1 + include/linux/nvme-fc-driver.h | 2 +- include/linux/nvme.h| 37 ++- include/linux/string.h | 30 +++ 15 files changed, 600 insertions(+), 388 deletions(-)
Re: [PATCH 07/10] nvme: track shared namespaces
On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote: > If a namespace can be accessed by another subsystem, the above line > will ignore such namespace. And that's intentional. > Or does the NVMe/NVMf specification constrain that any namespace > can only be accessed by a subsystem? Yes, inside the NVMe spec a Namespace is contained inside a Subsystem. That doesn't preclude other ways to access the LBAs, but they are outside the specification (e.g. also exporting them as SCSI).
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Mon, Aug 28, 2017 at 04:40:43PM +0300, Sagi Grimberg wrote: > I thought your multipathing code would really live in the block > layer and only require something really basic from nvme (which could > easily be applied on other drivers). But I do understand it might > create a lot of churn. The earlier versions did a lot more in common code, but I gradually moved away from that: - first I didn't have a separate queue, but just bounced I/O between sibling queues. So there was no new make_request based queue, and we had to track the relations in the block layer, with a callback to check the path status - I got rid of the non-trivial path selector So in the end very little block layer code remained. But then again very little nvme code remained either.. > btw, why are partial completions something that can't be done > without cloning the bio? is it possible to clone the bio once from the > completion flow when you see that you got a partial completion? The problem with partial completions is that blk_update_request completes bios as soon as it get enough bytes to finish them. This should not be an unsolvable problem, but it will be a bit messy at least. But then again I hope that no new protocols will be designed with partial completions - SCSI is pretty special in that regard.
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
the make_request That basically just does a lookup (in a data structure we need for tracking the siblings inside nvme anyway) and the submits the I/O again. The generic code all is in generic_make_request_fast. There are two more things which could move into common code, one reasonable, the other not: (1) the whole requeue_list list logic. I thought about moving this to the block layer as I'd also have some other use cases for it, but decided to wait until those materialize to see if I'd really need it. (2) turning the logic inside out, e.g. providing a generic make_request and supplying a find_path callback to it. This would require (1) but even then we'd save basically no code, add an indirect call in the fast path and make things harder to read. This actually is how I started out and didn't like it. and failover logic The big thing about the failover logic is looking a the detailed error codes and changing the controller state, all of which is fundamentally driver specific. The only thing that could reasonably be common is the requeue_list handling as mentioned above. Note that this will become even more integrated with the nvme driver once ANA support lands. Not arguing with you at all, you obviously gave it a lot of thought. I thought your multipathing code would really live in the block layer and only require something really basic from nvme (which could easily be applied on other drivers). But I do understand it might create a lot of churn. btw, why are partial completions something that can't be done without cloning the bio? is it possible to clone the bio once from the completion flow when you see that you got a partial completion?
Re: I/O hangs after resuming from suspend-to-ram
On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote: > Ming Lei - 28.08.17, 20:58: > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote: > > > Hi. > > > > > > Here is disk setup for QEMU VM: > > > > > > === > > > [root@archmq ~]# smartctl -i /dev/sda > > > … > > > Device Model: QEMU HARDDISK > > > Serial Number:QM1 > > > Firmware Version: 2.5+ > > > User Capacity:4,294,967,296 bytes [4.29 GB] > > > Sector Size: 512 bytes logical/physical > > > Device is:Not in smartctl database [for details use: -P showall] > > > ATA Version is: ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000 > > > Local Time is:Sun Aug 27 09:31:54 2017 CEST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > [root@archmq ~]# lsblk > > > NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > > sda 8:004G 0 disk > > > `-sda18:104G 0 part > > > > > > `-md0 9:004G 0 raid10 > > > > > > `-system253:004G 0 crypt > > > > > > |-system-boot 253:10 512M 0 lvm/boot > > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > > > `-system-root 253:303G 0 lvm/ > > > > > > sdb 8:16 04G 0 disk > > > `-sdb18:17 04G 0 part > > > > > > `-md0 9:004G 0 raid10 > > > > > > `-system253:004G 0 crypt > > > > > > |-system-boot 253:10 512M 0 lvm/boot > > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > > > `-system-root 253:303G 0 lvm/ > > > > > > sr0 11:01 1024M 0 rom > > > > > > [root@archmq ~]# mdadm --misc --detail /dev/md0 > > > > > > /dev/md0: > > > Version : 1.2 > > > > > > Creation Time : Sat Jul 29 16:37:05 2017 > > > > > > Raid Level : raid10 > > > Array Size : 4191232 (4.00 GiB 4.29 GB) > > > > > > Used Dev Size : 4191232 (4.00 GiB 4.29 GB) > > > > > >Raid Devices : 2 > > > > > > Total Devices : 2 > > > > > > Persistence : Superblock is persistent > > > > > > Update Time : Sun Aug 27 09:30:33 2017 > > > > > > State : clean > > > > > > Active Devices : 2 > > > > > > Working Devices : 2 > > > > > > Failed Devices : 0 > > > > > > Spare Devices : 0 > > > > > > Layout : far=2 > > > > > > Chunk Size : 512K > > > > > >Name : archiso:0 > > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e > > > > > > Events : 485 > > > > > > Number Major Minor RaidDevice State > > > > > >0 810 active sync /dev/sda1 > > >1 8 171 active sync /dev/sdb1 > > > > > > === > > > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it, > > > then > > > LVM, then ext4 for boot, swap and btrfs for /. > > > > > > I couldn't reproduce the issue with single disk without RAID. > > > > Could you verify if the following patch fixes your issue? > > Could this also apply to non MD RAID systems? I am using BTRFS RAID > 1 with two SSDs. So far with CFQ it runs stable. It is for fixing Oleksandr's issue wrt. blk-mq, and looks not for you. -- Ming
Re: I/O hangs after resuming from suspend-to-ram
Ming Lei - 28.08.17, 20:58: > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote: > > Hi. > > > > Here is disk setup for QEMU VM: > > > > === > > [root@archmq ~]# smartctl -i /dev/sda > > … > > Device Model: QEMU HARDDISK > > Serial Number:QM1 > > Firmware Version: 2.5+ > > User Capacity:4,294,967,296 bytes [4.29 GB] > > Sector Size: 512 bytes logical/physical > > Device is:Not in smartctl database [for details use: -P showall] > > ATA Version is: ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000 > > Local Time is:Sun Aug 27 09:31:54 2017 CEST > > SMART support is: Available - device has SMART capability. > > SMART support is: Enabled > > > > [root@archmq ~]# lsblk > > NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > sda 8:004G 0 disk > > `-sda18:104G 0 part > > > > `-md0 9:004G 0 raid10 > > > > `-system253:004G 0 crypt > > > > |-system-boot 253:10 512M 0 lvm/boot > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > `-system-root 253:303G 0 lvm/ > > > > sdb 8:16 04G 0 disk > > `-sdb18:17 04G 0 part > > > > `-md0 9:004G 0 raid10 > > > > `-system253:004G 0 crypt > > > > |-system-boot 253:10 512M 0 lvm/boot > > |-system-swap 253:20 512M 0 lvm[SWAP] > > > > `-system-root 253:303G 0 lvm/ > > > > sr0 11:01 1024M 0 rom > > > > [root@archmq ~]# mdadm --misc --detail /dev/md0 > > > > /dev/md0: > > Version : 1.2 > > > > Creation Time : Sat Jul 29 16:37:05 2017 > > > > Raid Level : raid10 > > Array Size : 4191232 (4.00 GiB 4.29 GB) > > > > Used Dev Size : 4191232 (4.00 GiB 4.29 GB) > > > >Raid Devices : 2 > > > > Total Devices : 2 > > > > Persistence : Superblock is persistent > > > > Update Time : Sun Aug 27 09:30:33 2017 > > > > State : clean > > > > Active Devices : 2 > > > > Working Devices : 2 > > > > Failed Devices : 0 > > > > Spare Devices : 0 > > > > Layout : far=2 > > > > Chunk Size : 512K > > > >Name : archiso:0 > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e > > > > Events : 485 > > > > Number Major Minor RaidDevice State > > > >0 810 active sync /dev/sda1 > >1 8 171 active sync /dev/sdb1 > > > > === > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it, > > then > > LVM, then ext4 for boot, swap and btrfs for /. > > > > I couldn't reproduce the issue with single disk without RAID. > > Could you verify if the following patch fixes your issue? Could this also apply to non MD RAID systems? I am using BTRFS RAID 1 with two SSDs. So far with CFQ it runs stable. Thanks, Martin > From 9fa53d708ebc1d5b87e62e542dc54272529da244 Mon Sep 17 00:00:00 2001 > From: Ming Lei> Date: Mon, 28 Aug 2017 19:59:08 +0800 > Subject: [PATCH] blk-mq: align to legacy path for implementing > blk_execute_rq > > In legacy path, when one request is run via blk_execute_rq(), > it is added to q->queue_head directly, and I/O scheduler's > queue is bypassed because either merging or sorting isn't > needed. > > When SCSI device is put into quiece state, such as during > system suspend, we need to add the request of RQF_PM into > head of the queue. > > This patch fixes I/O hang after system resume. > > Reported-by: Oleksandr Natalenko > Signed-off-by: Ming Lei > --- > block/blk-core.c | 2 +- > block/blk-exec.c | 2 +- > block/blk-flush.c| 2 +- > block/blk-mq-sched.c | 58 > block/blk-mq-sched.h | > 2 ++ > 5 files changed, 63 insertions(+), 3 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index dbecbf4a64e0..fb75bc646ebc 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2330,7 +2330,7 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * if (q->mq_ops) { > if (blk_queue_io_stat(q)) > blk_account_io_start(rq, true); > - blk_mq_sched_insert_request(rq, false, true, false, false); > + blk_mq_sched_insert_request_bypass(rq, false, true, false, > false); > return BLK_STS_OK; > } > > diff --git a/block/blk-exec.c b/block/blk-exec.c > index 5c0f3dc446dc..4565aa6bb624 100644 > --- a/block/blk-exec.c > +++ b/block/blk-exec.c > @@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct > gendisk *bd_disk, * be reused after dying flag is set >*/ >
Re: I/O hangs after resuming from suspend-to-ram
On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote: > Hi. > > Here is disk setup for QEMU VM: > > === > [root@archmq ~]# smartctl -i /dev/sda > … > Device Model: QEMU HARDDISK > Serial Number:QM1 > Firmware Version: 2.5+ > User Capacity:4,294,967,296 bytes [4.29 GB] > Sector Size: 512 bytes logical/physical > Device is:Not in smartctl database [for details use: -P showall] > ATA Version is: ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000 > Local Time is:Sun Aug 27 09:31:54 2017 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > [root@archmq ~]# lsblk > NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT > sda 8:004G 0 disk > `-sda18:104G 0 part > `-md0 9:004G 0 raid10 > `-system253:004G 0 crypt > |-system-boot 253:10 512M 0 lvm/boot > |-system-swap 253:20 512M 0 lvm[SWAP] > `-system-root 253:303G 0 lvm/ > sdb 8:16 04G 0 disk > `-sdb18:17 04G 0 part > `-md0 9:004G 0 raid10 > `-system253:004G 0 crypt > |-system-boot 253:10 512M 0 lvm/boot > |-system-swap 253:20 512M 0 lvm[SWAP] > `-system-root 253:303G 0 lvm/ > sr0 11:01 1024M 0 rom > > [root@archmq ~]# mdadm --misc --detail /dev/md0 > /dev/md0: > Version : 1.2 > Creation Time : Sat Jul 29 16:37:05 2017 > Raid Level : raid10 > Array Size : 4191232 (4.00 GiB 4.29 GB) > Used Dev Size : 4191232 (4.00 GiB 4.29 GB) >Raid Devices : 2 > Total Devices : 2 > Persistence : Superblock is persistent > > Update Time : Sun Aug 27 09:30:33 2017 > State : clean > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > Layout : far=2 > Chunk Size : 512K > >Name : archiso:0 >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e > Events : 485 > > Number Major Minor RaidDevice State >0 810 active sync /dev/sda1 >1 8 171 active sync /dev/sdb1 > === > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it, then > LVM, then ext4 for boot, swap and btrfs for /. > > I couldn't reproduce the issue with single disk without RAID. Could you verify if the following patch fixes your issue? >From 9fa53d708ebc1d5b87e62e542dc54272529da244 Mon Sep 17 00:00:00 2001 From: Ming LeiDate: Mon, 28 Aug 2017 19:59:08 +0800 Subject: [PATCH] blk-mq: align to legacy path for implementing blk_execute_rq In legacy path, when one request is run via blk_execute_rq(), it is added to q->queue_head directly, and I/O scheduler's queue is bypassed because either merging or sorting isn't needed. When SCSI device is put into quiece state, such as during system suspend, we need to add the request of RQF_PM into head of the queue. This patch fixes I/O hang after system resume. Reported-by: Oleksandr Natalenko Signed-off-by: Ming Lei --- block/blk-core.c | 2 +- block/blk-exec.c | 2 +- block/blk-flush.c| 2 +- block/blk-mq-sched.c | 58 block/blk-mq-sched.h | 2 ++ 5 files changed, 63 insertions(+), 3 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index dbecbf4a64e0..fb75bc646ebc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2330,7 +2330,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * if (q->mq_ops) { if (blk_queue_io_stat(q)) blk_account_io_start(rq, true); - blk_mq_sched_insert_request(rq, false, true, false, false); + blk_mq_sched_insert_request_bypass(rq, false, true, false, false); return BLK_STS_OK; } diff --git a/block/blk-exec.c b/block/blk-exec.c index 5c0f3dc446dc..4565aa6bb624 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk, * be reused after dying flag is set */ if (q->mq_ops) { - blk_mq_sched_insert_request(rq, at_head, true, false, false); + blk_mq_sched_insert_request_bypass(rq, at_head, true, false, false); return; } diff --git a/block/blk-flush.c b/block/blk-flush.c index ed5fe322abba..51e89e5c525a 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq) if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { if
Re: [PATCH 07/10] nvme: track shared namespaces
On 2017/8/24 1:58, Christoph Hellwig wrote: > +static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid, > + struct nvme_id_ns *id) > +{ > + struct nvme_ctrl *ctrl = ns->ctrl; > + bool is_shared = id->nmic & (1 << 0); > + struct nvme_ns_head *head = NULL; > + int ret = 0; > + > + mutex_lock(>subsys->lock); > + if (is_shared) > + head = __nvme_find_ns_head(ctrl->subsys, nsid); If a namespace can be accessed by another subsystem, the above line will ignore such namespace. Or does the NVMe/NVMf specification constrain that any namespace can only be accessed by a subsystem? More comments after testing will be sent later. Thanks Guan Junxiong
Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
That being said ->make_request basically doesn't care about actual limits at all, it mostly care about support features (e.g. discard, fua, etc). So I think a lot of these limits could porbably be lifted, but I'd need to add the check to generic_make_request_checks back. Different virt_boundary capabilities will trigger bio splits which can make make_request blocking (due to lack of tags). All the bio splitting is done in blk_queue_split, and other things related to limits are done even later in blk_mq_make_request when building the request. For normal make_request based stacking drivers nothing of this matters, although a few drivers t call blk_queue_split manually from their make_request method. Maybe I misunderstood the change log comment, but didn't it said that the caller is not allowed to submit a bio that might split?
Re: [PATCH] block, scheduler: convert xxx_var_store to void
On Fri, Aug 25, 2017 at 01:11:33AM +0800, weiping zhang wrote: > The last parameter "count" never be used in xxx_var_store, > convert these functions to void. > > Signed-off-by: weiping zhang> --- > block/bfq-iosched.c | 33 + > block/cfq-iosched.c | 13 ++--- > block/deadline-iosched.c | 9 - > block/mq-deadline.c | 9 - > 4 files changed, 31 insertions(+), 33 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 436b6ca..7a4085d 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4787,16 +4787,13 @@ static ssize_t bfq_var_show(unsigned int var, char > *page) > return sprintf(page, "%u\n", var); > } > > -static ssize_t bfq_var_store(unsigned long *var, const char *page, > - size_t count) > +static void bfq_var_store(unsigned long *var, const char *page) > { > unsigned long new_val; > int ret = kstrtoul(page, 10, _val); > > if (ret == 0) > *var = new_val; > - > - return count; > } > > #define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \ > @@ -4838,7 +4835,7 @@ __FUNC(struct elevator_queue *e, const char *page, > size_t count)\ > {\ > struct bfq_data *bfqd = e->elevator_data; \ > unsigned long uninitialized_var(__data);\ > - int ret = bfq_var_store(&__data, (page), count);\ > + bfq_var_store(&__data, (page)); \ > if (__data < (MIN)) \ > __data = (MIN); \ > else if (__data > (MAX))\ > @@ -4849,7 +4846,7 @@ __FUNC(struct elevator_queue *e, const char *page, > size_t count)\ > *(__PTR) = (u64)__data * NSEC_PER_MSEC; \ > else\ > *(__PTR) = __data; \ > - return ret; \ > + return count; \ > } Hi Jens, Would you please look this misc patch at your convenience ? Thanks
Re: [PATCH 07/10] nvme: track shared namespaces
> On 23 Aug 2017, at 19.58, Christoph Hellwigwrote: > > Introduce a new struct nvme_ns_head [1] that holds information about > an actual namespace, unlike struct nvme_ns, which only holds the > per-controller namespace information. For private namespaces there > is a 1:1 relation of the two, but for shared namespaces this lets us > discover all the paths to it. For now only the identifiers are moved > to the new structure, but most of the information in struct nvme_ns > should eventually move over. > > To allow lockless path lookup the list of nvme_ns structures per > nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns > structure through call_srcu. > > [1] comments welcome if you have a better name for it, the current one is >horrible. One idea would be to rename the current struct nvme_ns >to struct nvme_ns_link or similar and use the nvme_ns name for the >new structure. But that would involve a lot of churn. > > Signed-off-by: Christoph Hellwig > --- > drivers/nvme/host/core.c | 218 +++ > drivers/nvme/host/lightnvm.c | 14 +-- Nothing big here. Looks good. Reviewed-by: Javier González signature.asc Description: Message signed with OpenPGP
Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
On Mon, Aug 28, 2017 at 02:01:16PM +0300, Sagi Grimberg wrote: >> That being said ->make_request basically doesn't care about actual >> limits at all, it mostly care about support features (e.g. discard, fua, >> etc). So I think a lot of these limits could porbably be lifted, >> but I'd need to add the check to generic_make_request_checks back. > > Different virt_boundary capabilities will trigger bio splits which > can make make_request blocking (due to lack of tags). All the bio splitting is done in blk_queue_split, and other things related to limits are done even later in blk_mq_make_request when building the request. For normal make_request based stacking drivers nothing of this matters, although a few drivers t call blk_queue_split manually from their make_request method.
Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
This helper allows reinserting a bio into a new queue without much overhead, but requires all queue limits to be the same for the upper and lower queues, and it does not provide any recursion preventions, which requires the caller to not split the bio. Isn't the same limits constraint too restrictive? Say I have two paths to the same namespace via two different HBAs, each with its own virt_boundary capability for example? That would require us to split failover bio wouldn't it? Uh oh - different transports for the same subsystem will be intereting. For one it's not specified anywhere so I'd like to kick off a discussion on the working group mailing list about it. Indeed that would be interesting, but I wasn't referring to different transports, even the same transport can have different capabilities (for example CX4 and CX4 devices on the same host). That being said ->make_request basically doesn't care about actual limits at all, it mostly care about support features (e.g. discard, fua, etc). So I think a lot of these limits could porbably be lifted, but I'd need to add the check to generic_make_request_checks back. Different virt_boundary capabilities will trigger bio splits which can make make_request blocking (due to lack of tags).
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Mon, Aug 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote: > This is really taking a lot into the nvme driver. This patch adds about 100 lines of code, out of which 1/4th is the parsing of status codes. I don't think it's all that much.. > I'm not sure if > this approach will be used in other block driver, but would it > make sense to place the block_device node creation, I really want to reduce the amount of boilerplate code for creating a block queue and I have some ideas for that. But that's not in any way related to this multi-path code. > the make_request That basically just does a lookup (in a data structure we need for tracking the siblings inside nvme anyway) and the submits the I/O again. The generic code all is in generic_make_request_fast. There are two more things which could move into common code, one reasonable, the other not: (1) the whole requeue_list list logic. I thought about moving this to the block layer as I'd also have some other use cases for it, but decided to wait until those materialize to see if I'd really need it. (2) turning the logic inside out, e.g. providing a generic make_request and supplying a find_path callback to it. This would require (1) but even then we'd save basically no code, add an indirect call in the fast path and make things harder to read. This actually is how I started out and didn't like it. > and failover logic The big thing about the failover logic is looking a the detailed error codes and changing the controller state, all of which is fundamentally driver specific. The only thing that could reasonably be common is the requeue_list handling as mentioned above. Note that this will become even more integrated with the nvme driver once ANA support lands. > and maybe the path selection in the block layer > leaving just the construction of the path mappings in nvme? The point is that path selection fundamentally depends on your actual protocol. E.g. for NVMe we now look at the controller state as that's what our keep alive work on, and what reflects that status in PCIe. We could try to communicate this up, but it would just lead to more data structure that get out of sync. And this will become much worse with ANA where we have even more fine grained state. In the end path selection right now is less than 20 lines of code. Path selection is complicated when you want non-trivial path selector algorithms like round robin or service time, but I'd really avoid having those on nvme - we already have multiple queues to go beyong the limits of a single queue and use blk-mq for that, and once we're beyond the limits of a single transport path for performance reasons I'd much rather rely on ANA, numa nodes or static partitioning of namepsaces than trying to do dynamic decicions in the fast path. But if I don't get what I want there and we'll need more complicated algorithms they absolutely should be in common code. In fact one of my earlier protoypes moved the dm-mpath path selector algorithms to core block code and tried to use them - it just turned out enabling them was pretty much pointless.
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
I still disagree that we should have an indirect function call like this in the fast path. All that can be done by clearing or setting a flag on the first call to ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for that, but for SCSI we probablt can't use that given that it has more meaning for the old request path. But how about just adding a new RQD_DRV_INITIALIZED or similar flag?
Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
On Mon, Aug 28, 2017 at 10:00:35AM +0300, Sagi Grimberg wrote: >> This helper allows reinserting a bio into a new queue without much >> overhead, but requires all queue limits to be the same for the upper >> and lower queues, and it does not provide any recursion preventions, >> which requires the caller to not split the bio. > > Isn't the same limits constraint too restrictive? > > Say I have two paths to the same namespace via two different HBAs, each > with its own virt_boundary capability for example? That would require us > to split failover bio wouldn't it? Uh oh - different transports for the same subsystem will be intereting. For one it's not specified anywhere so I'd like to kick off a discussion on the working group mailing list about it. That being said ->make_request basically doesn't care about actual limits at all, it mostly care about support features (e.g. discard, fua, etc). So I think a lot of these limits could porbably be lifted, but I'd need to add the check to generic_make_request_checks back. >> +/* >> + * Fast-path version of generic_make_request. > > generic_make_request is also called in the fast-path, maybe reword it > to: "Fast version of generic_make_request" Yeah. Maybe generic_make_request_direct or direct_make_request is a better name as it describes the recursion avoidance bypassing a little better.
Re: [PATCH 07/10] nvme: track shared namespaces
On Mon, Aug 28, 2017 at 09:51:14AM +0300, Sagi Grimberg wrote: >> To allow lockless path lookup the list of nvme_ns structures per >> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns >> structure through call_srcu. > > I haven't read the later patches yet, but what requires sleep in the > path selection? ->make_request is allowed to sleep, and often will. >> +} else { >> +u8 eui64[8] = { 0 }, nguid[16] = { 0 }; >> +uuid_t uuid = uuid_null; >> + >> +nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, ); >> +if (!uuid_equal(>uuid, ) || >> +memcmp(>nguid, , sizeof(head->nguid)) || >> +memcmp(>eui64, , sizeof(head->eui64))) { > > Suggestion, given that this matching pattern returns in several places > would it be better to move it to nvme_ns_match_id()? I'll look into it. Maybe we'll need a nvme_ns_ids structure to avoid having tons of parameters, though. >> +/* >> + * Anchor structure for namespaces. There is one for each namespace in a >> + * NVMe subsystem that any of our controllers can see, and the namespace >> + * structure for each controller is chained of it. For private namespaces >> + * there is a 1:1 relation to our namespace structures, that is ->list >> + * only ever has a single entry for private namespaces. >> + */ >> +struct nvme_ns_head { >> +struct list_headlist; > > Maybe siblings is a better name than list, > and the nvme_ns list_head should be called > sibling_entry (or just sibling)? Yeah.
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
This patch adds initial multipath support to the nvme driver. For each namespace we create a new block device node, which can be used to access that namespace through any of the controllers that refer to it. Currently we will always send I/O to the first available path, this will be changed once the NVMe Asynchronous Namespace Access (ANA) TP is ratified and implemented, at which point we will look at the ANA state for each namespace. Another possibility that was prototyped is to use the path that is closes to the submitting NUMA code, which will be mostly interesting for PCI, but might also be useful for RDMA or FC transports in the future. There is not plan to implement round robin or I/O service time path selectors, as those are not scalable with the performance rates provided by NVMe. The multipath device will go away once all paths to it disappear, any delay to keep it alive needs to be implemented at the controller level. TODO: implement sysfs interfaces for the new subsystem and subsystem-namespace object. Unless we can come up with something better than sysfs here.. Signed-off-by: Christoph HellwigChristoph, This is really taking a lot into the nvme driver. I'm not sure if this approach will be used in other block driver, but would it make sense to place the block_device node creation, the make_request and failover logic and maybe the path selection in the block layer leaving just the construction of the path mappings in nvme?
Re: [PATCH 09/10] blk-mq: add a blk_steal_bios helper
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
This helper allows reinserting a bio into a new queue without much overhead, but requires all queue limits to be the same for the upper and lower queues, and it does not provide any recursion preventions, which requires the caller to not split the bio. Isn't the same limits constraint too restrictive? Say I have two paths to the same namespace via two different HBAs, each with its own virt_boundary capability for example? That would require us to split failover bio wouldn't it? +/* + * Fast-path version of generic_make_request. generic_make_request is also called in the fast-path, maybe reword it to: "Fast version of generic_make_request"
Re: [PATCH 07/10] nvme: track shared namespaces
On 23/08/17 20:58, Christoph Hellwig wrote: Introduce a new struct nvme_ns_head [1] that holds information about an actual namespace, unlike struct nvme_ns, which only holds the per-controller namespace information. For private namespaces there is a 1:1 relation of the two, but for shared namespaces this lets us discover all the paths to it. For now only the identifiers are moved to the new structure, but most of the information in struct nvme_ns should eventually move over. To allow lockless path lookup the list of nvme_ns structures per nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns structure through call_srcu. I haven't read the later patches yet, but what requires sleep in the path selection? [1] comments welcome if you have a better name for it, the current one is horrible. One idea would be to rename the current struct nvme_ns to struct nvme_ns_link or similar and use the nvme_ns name for the new structure. But that would involve a lot of churn. maybe nvme_ns_primary? Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 218 +++ drivers/nvme/host/lightnvm.c | 14 +-- drivers/nvme/host/nvme.h | 26 +- 3 files changed, 208 insertions(+), 50 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 8884000dfbdd..abc5911a8a66 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -249,10 +249,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, } EXPORT_SYMBOL_GPL(nvme_change_ctrl_state); +static void nvme_destroy_ns_head(struct kref *ref) +{ + struct nvme_ns_head *head = + container_of(ref, struct nvme_ns_head, ref); + + list_del_init(>entry); + cleanup_srcu_struct(>srcu); + kfree(head); +} + +static void nvme_put_ns_head(struct nvme_ns_head *head) +{ + kref_put(>ref, nvme_destroy_ns_head); +} + static void nvme_free_ns(struct kref *kref) { struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref); + if (ns->head) + nvme_put_ns_head(ns->head); + if (ns->ndev) nvme_nvm_unregister(ns); @@ -422,7 +440,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns, { memset(cmnd, 0, sizeof(*cmnd)); cmnd->common.opcode = nvme_cmd_flush; - cmnd->common.nsid = cpu_to_le32(ns->ns_id); + cmnd->common.nsid = cpu_to_le32(ns->head->ns_id); } static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, @@ -453,7 +471,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, memset(cmnd, 0, sizeof(*cmnd)); cmnd->dsm.opcode = nvme_cmd_dsm; - cmnd->dsm.nsid = cpu_to_le32(ns->ns_id); + cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id); cmnd->dsm.nr = cpu_to_le32(segments - 1); cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD); @@ -492,7 +510,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, memset(cmnd, 0, sizeof(*cmnd)); cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read); - cmnd->rw.nsid = cpu_to_le32(ns->ns_id); + cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id); cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req))); cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); @@ -977,7 +995,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) memset(, 0, sizeof(c)); c.rw.opcode = io.opcode; c.rw.flags = io.flags; - c.rw.nsid = cpu_to_le32(ns->ns_id); + c.rw.nsid = cpu_to_le32(ns->head->ns_id); c.rw.slba = cpu_to_le64(io.slba); c.rw.length = cpu_to_le16(io.nblocks); c.rw.control = cpu_to_le16(io.control); @@ -1041,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode, switch (cmd) { case NVME_IOCTL_ID: force_successful_syscall_return(); - return ns->ns_id; + return ns->head->ns_id; case NVME_IOCTL_ADMIN_CMD: return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg); case NVME_IOCTL_IO_CMD: @@ -1248,7 +1266,7 @@ static int nvme_revalidate_disk(struct gendisk *disk) return -ENODEV; } - id = nvme_identify_ns(ctrl, ns->ns_id); + id = nvme_identify_ns(ctrl, ns->head->ns_id); if (!id) return -ENODEV; @@ -1257,12 +1275,12 @@ static int nvme_revalidate_disk(struct gendisk *disk) goto out; } - nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, ); - if (!uuid_equal(>uuid, ) || - memcmp(>nguid, , sizeof(ns->nguid)) || - memcmp(>eui, , sizeof(ns->eui))) { + nvme_report_ns_ids(ctrl, ns->head->ns_id, id, eui64, nguid, ); + if (!uuid_equal(>head->uuid, ) || +
Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
Shouldn't uuid,nguid,eui64 record the previous values prior to calling nvme_report_ns_ids? The previous values are still in the namespace structure, and we use the on-stack variables for the current ones. Of course, misread that one, thanks.
Re: [PATCH 06/10] nvme: track subsystems
This looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
On Mon, Aug 28, 2017 at 09:17:44AM +0300, Sagi Grimberg wrote: >> +u8 eui64[8] = { 0 }, nguid[16] = { 0 }; >> +uuid_t uuid = uuid_null; >> int ret = 0; >> if (test_bit(NVME_NS_DEAD, >flags)) { >> @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk) >> goto out; >> } >> - nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, >> >uuid); >> +nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, ); >> +if (!uuid_equal(>uuid, ) || >> +memcmp(>nguid, , sizeof(ns->nguid)) || >> +memcmp(>eui, , sizeof(ns->eui))) { > > Shouldn't uuid,nguid,eui64 record the previous values prior to calling > nvme_report_ns_ids? The previous values are still in the namespace structure, and we use the on-stack variables for the current ones.
Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
Instead validate that these identifiers do not change, as that is prohibited by the specification. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 157dbb7b328d..179ade01745b 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1236,6 +1236,8 @@ static int nvme_revalidate_disk(struct gendisk *disk) struct nvme_ns *ns = disk->private_data; struct nvme_ctrl *ctrl = ns->ctrl; struct nvme_id_ns *id; + u8 eui64[8] = { 0 }, nguid[16] = { 0 }; + uuid_t uuid = uuid_null; int ret = 0; if (test_bit(NVME_NS_DEAD, >flags)) { @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk) goto out; } - nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, >uuid); + nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, ); + if (!uuid_equal(>uuid, ) || + memcmp(>nguid, , sizeof(ns->nguid)) || + memcmp(>eui, , sizeof(ns->eui))) { Shouldn't uuid,nguid,eui64 record the previous values prior to calling nvme_report_ns_ids?
Re: [PATCH 04/10] nvme: remove nvme_revalidate_ns
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 03/10] nvme: remove unused struct nvme_ns fields
Reviewed-by: Sagi Grimberg
Re: [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
Reviewed-by: Sagi Grimberg
Re: [PATCH 01/10] nvme: report more detailed status codes to the block layer
Looks good, Reviewed-by: Sagi Grimberg