Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues
On 10/20/2016 07:52 AM, Keith Busch wrote: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ccd9cc5..078530c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -201,7 +201,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk) void nvme_requeue_req(struct request *req) { - blk_mq_requeue_request(req, true); + blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q)); } EXPORT_SYMBOL_GPL(nvme_requeue_req); Hello Keith, What I had missed while I was preparing my patch series is that the NVMe driver, unlike the dm driver, can call blk_mq_requeue_request() on a stopped queue. So the above patch is needed to keep the current semantics of the NVMe code. I will merge this patch in my patch series. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues
On Wed, Oct 19, 2016 at 04:51:18PM -0700, Bart Van Assche wrote: > > I assume that line 498 in blk-mq.c corresponds to BUG_ON(blk_queued_rq(rq))? > Anyway, it seems to me like this is a bug in the NVMe code and also that > this bug is completely unrelated to my patch series. In nvme_complete_rq() I > see that blk_mq_requeue_request() is called. I don't think this is allowed > from the context of nvme_cancel_request() because blk_mq_requeue_request() > assumes that a request has already been removed from the request list. > However, neither blk_mq_tagset_busy_iter() nor nvme_cancel_request() remove > a request from the request list before nvme_complete_rq() is called. I think > this is what triggers the BUG_ON() statement in blk_mq_requeue_request(). > Have you noticed that e.g. the scsi-mq code only calls > blk_mq_requeue_request() after __blk_mq_end_request() has finished? Have you > considered to follow the same approach in nvme_cancel_request()? Both nvme and scsi requeue through their mp_ops 'complete' callback, so nvme is similarly waiting for __blk_mq_end_request before requesting to requeue. The problem, I think, is nvme's IO cancelling path is observing active requests that it's requeuing from the queue_rq path. Patch [11/11] kicks the requeue list unconditionally. This restarts queues the driver had just quiesced a moment before, restarting those requests, but the driver isn't ready to handle them. When the driver ultimately unbinds from the device, it requeues those requests a second time. Either the requeuing can't kick the requeue work when queisced, or the shutdown needs to quiesce even when it hasn't restarted the queues. Either patch below appears to fix the issue. --- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ccd9cc5..078530c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -201,7 +201,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk) void nvme_requeue_req(struct request *req) { - blk_mq_requeue_request(req, true); + blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q)); } EXPORT_SYMBOL_GPL(nvme_requeue_req); -- --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 4b30fa2..a05da98 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1681,10 +1681,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) del_timer_sync(&dev->watchdog_timer); mutex_lock(&dev->shutdown_lock); - if (pci_is_enabled(to_pci_dev(dev->dev))) { - nvme_stop_queues(&dev->ctrl); + nvme_stop_queues(&dev->ctrl); + if (pci_is_enabled(to_pci_dev(dev->dev))) csts = readl(dev->bar + NVME_REG_CSTS); - } queues = dev->online_queues - 1; for (i = dev->queue_count - 1; i > 0; i--) -- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues
On 10/19/2016 03:14 PM, Keith Busch wrote: I'm running linux 4.9-rc1 + linux-block/for-linus, and alternating tests with and without this series. Without this, I'm not seeing any problems in a link-down test while running fio after ~30 runs. With this series, I only see the test pass infrequently. Most of the time I observe one of several failures. In all cases, it looks like the rq->queuelist is in an unexpected state. I think I've almost got this tracked down, but I have to leave for the day soon. Rather than having a more useful suggestion, I've put the two failures below. > First failure: > [ 214.782098] kernel BUG at block/blk-mq.c:498! Hello Keith, Thank you for having taken the time to test this patch series. Since I think that the second and third failures are consequences of the first, I will focus on the first failure triggered by your tests. I assume that line 498 in blk-mq.c corresponds to BUG_ON(blk_queued_rq(rq))? Anyway, it seems to me like this is a bug in the NVMe code and also that this bug is completely unrelated to my patch series. In nvme_complete_rq() I see that blk_mq_requeue_request() is called. I don't think this is allowed from the context of nvme_cancel_request() because blk_mq_requeue_request() assumes that a request has already been removed from the request list. However, neither blk_mq_tagset_busy_iter() nor nvme_cancel_request() remove a request from the request list before nvme_complete_rq() is called. I think this is what triggers the BUG_ON() statement in blk_mq_requeue_request(). Have you noticed that e.g. the scsi-mq code only calls blk_mq_requeue_request() after __blk_mq_end_request() has finished? Have you considered to follow the same approach in nvme_cancel_request()? Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues
Hi Bart, I'm running linux 4.9-rc1 + linux-block/for-linus, and alternating tests with and without this series. Without this, I'm not seeing any problems in a link-down test while running fio after ~30 runs. With this series, I only see the test pass infrequently. Most of the time I observe one of several failures. In all cases, it looks like the rq->queuelist is in an unexpected state. I think I've almost got this tracked down, but I have to leave for the day soon. Rather than having a more useful suggestion, I've put the two failures below. First failure: [ 214.782075] [ cut here ] [ 214.782098] kernel BUG at block/blk-mq.c:498! [ 214.782117] invalid opcode: [#1] SMP [ 214.782133] Modules linked in: nvme nvme_core nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security ebtable_filter ebtables ip6table_filter ip6_tables vfat fat [ 214.782356] CPU: 6 PID: 160 Comm: kworker/u16:6 Not tainted 4.9.0-rc1+ #28 [ 214.782383] Hardware name: Gigabyte Technology Co., Ltd. Z97X-UD5H/Z97X-UD5H, BIOS F8 06/17/2014 [ 214.782419] Workqueue: nvme nvme_reset_work [nvme] [ 214.782440] task: 8c0815403b00 task.stack: b6ad01384000 [ 214.782463] RIP: 0010:[] [] blk_mq_requeue_request+0x35/0x40 [ 214.782502] RSP: 0018:b6ad01387b88 EFLAGS: 00010287 [ 214.782524] RAX: 8c0814b98400 RBX: 8c0814b98200 RCX: 7530 [ 214.782551] RDX: 0007 RSI: 0001 RDI: 8c0814b98200 [ 214.782578] RBP: b6ad01387b98 R08: R09: 9f408680 [ 214.783394] R10: 0394 R11: 0388 R12: 0001 [ 214.784212] R13: 8c081593a000 R14: 0001 R15: 8c080cdea740 [ 214.785033] FS: () GS:8c081fb8() knlGS: [ 214.785869] CS: 0010 DS: ES: CR0: 80050033 [ 214.786710] CR2: 7ffae4497f34 CR3: 0001dfe06000 CR4: 001406e0 [ 214.787559] Stack: [ 214.788406] 8c0814b98200 b6ad01387ba8 c03451b3 [ 214.789287] b6ad01387bd0 c0357a4a 8c0814b98200 d6acffc81a00 [ 214.790174] 0006 b6ad01387bf8 9f3b8e22 8c0814b98200 [ 214.791066] Call Trace: [ 214.791935] [] nvme_requeue_req+0x13/0x20 [nvme_core] [ 214.792810] [] nvme_complete_rq+0x16a/0x1d0 [nvme] [ 214.793680] [] __blk_mq_complete_request+0x72/0xe0 [ 214.794551] [] blk_mq_complete_request+0x1c/0x20 [ 214.795422] [] nvme_cancel_request+0x50/0x90 [nvme_core] [ 214.796299] [] bt_tags_iter+0x2e/0x40 [ 214.797157] [] blk_mq_tagset_busy_iter+0x173/0x1e0 [ 214.798005] [] ? nvme_shutdown_ctrl+0x100/0x100 [nvme_core] [ 214.798852] [] ? nvme_shutdown_ctrl+0x100/0x100 [nvme_core] [ 214.799682] [] nvme_dev_disable+0x11d/0x380 [nvme] [ 214.800511] [] ? acpi_unregister_gsi_ioapic+0x3a/0x40 [ 214.801344] [] ? dev_warn+0x6c/0x90 [ 214.802157] [] nvme_reset_work+0xa4/0xdc0 [nvme] [ 214.802961] [] ? __switch_to+0x2b6/0x5f0 [ 214.803773] [] process_one_work+0x15f/0x430 [ 214.804593] [] worker_thread+0x4e/0x490 [ 214.805419] [] ? process_one_work+0x430/0x430 [ 214.806255] [] kthread+0xd9/0xf0 [ 214.807096] [] ? kthread_park+0x60/0x60 [ 214.807946] [] ret_from_fork+0x25/0x30 [ 214.808801] Code: 54 53 48 89 fb 41 89 f4 e8 a9 fa ff ff 48 8b 03 48 39 c3 75 16 41 0f b6 d4 48 89 df be 01 00 00 00 e8 10 ff ff ff 5b 41 5c 5d c3 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 be 40 00 00 [ 214.810714] RIP [] blk_mq_requeue_request+0x35/0x40 [ 214.811628] RSP [ 214.812545] ---[ end trace 6ef3a3b6f8cea418 ]--- [ 214.813437] [ cut here ] Second failure, warning followed by NMI watchdog: [ 410.736619] [ cut here ] [ 410.736624] WARNING: CPU: 2 PID: 577 at lib/list_debug.c:29 __list_add+0x62/0xb0 [ 410.736883] list_add corruption. next->prev should be prev (acf481847d78), but was 931f8fb78000. (next=931f8fb78000). [ 410.736884] Modules linked in: nvme nvme_core nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle ebtable_filter ebtables ip6table_filter ip6_tables vfat fat [ 410.736902] CPU: 2 PID: 577 Comm: kworker/2:1H Not tainted 4.9.0-rc1+ #28 [ 410.736903] Hardware name: Gigabyte Technology Co., Ltd. Z97X-UD5H/Z97X-UD5H,
Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues
On 10/18/2016 02:48 PM, Bart Van Assche wrote: - blk_mq_quiesce_queue() has been reworked (thanks to Ming Lin and Sagi for their feedback). (replying to my own e-mail) A correction: Ming Lei provided feedback on v2 of this patch series instead of Ming Lin. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/11] Fix race conditions related to stopping block layer queues
Hello Jens, Multiple block drivers need the functionality to stop a request queue and to wait until all ongoing request_fn() / queue_rq() calls have finished without waiting until all outstanding requests have finished. Hence this patch series that introduces the blk_mq_quiesce_queue() and blk_mq_resume_queue() functions. The dm-mq, SRP and NVMe patches in this patch series are three examples of where these functions are useful. These patches have been tested on top of kernel v4.9-rc1. The following tests have been run to verify this patch series: - My own srp-test suite that stress-tests SRP on top of dm-multipath. - Mike's mptest suite that stress-tests dm-multipath. - fio on top of the NVMeOF host driver that was connected to the NVMeOF target driver on the same host. - Laurence verified the previous version (v2) of this patch series by running it through the Red Hat SRP test suite. Laurence also ran some NVMe tests (thanks Laurence!). The changes compared to the second version of this patch series are: - Changed the order of the patches in this patch series. - Added several new patches: a patch that avoids that .queue_rq() gets invoked from the direct submission path if a queue has been stopped and also a patch that introduces the helper function blk_mq_hctx_stopped(). - blk_mq_quiesce_queue() has been reworked (thanks to Ming Lin and Sagi for their feedback). - A bool 'kick' argument has been added to blk_mq_requeue_request(). - As proposed by Christoph, the code that waits for queuecommand() has been moved from the SRP transport driver to the SCSI core. Changes between v2 and v1: - Dropped the non-blk-mq changes from this patch series. - Added support for harware queues with BLK_MQ_F_BLOCKING set. - Added a call stack to the description of the dm race fix patch. - Dropped the non-scsi-mq changes from the SRP patch. - Added a patch that introduces blk_mq_queue_stopped() in the dm driver. The individual patches in this series are: 0001-blk-mq-Do-not-invoke-.queue_rq-for-a-stopped-queue.patch 0002-blk-mq-Introduce-blk_mq_hctx_stopped.patch 0003-blk-mq-Introduce-blk_mq_queue_stopped.patch 0004-blk-mq-Introduce-blk_mq_quiesce_queue.patch 0005-blk-mq-Add-a-kick_requeue_list-argument-to-blk_mq_re.patch 0006-dm-Use-BLK_MQ_S_STOPPED-instead-of-QUEUE_FLAG_STOPPE.patch 0007-dm-Fix-a-race-condition-related-to-stopping-and-star.patch 0008-SRP-transport-Move-queuecommand-wait-code-to-SCSI-co.patch 0009-SRP-transport-scsi-mq-Wait-for-.queue_rq-if-necessar.patch 0010-nvme-Use-BLK_MQ_S_STOPPED-instead-of-QUEUE_FLAG_STOP.patch 0011-nvme-Fix-a-race-condition.patch Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html