Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-20 Thread Bart Van Assche

On 10/20/2016 07:52 AM, Keith Busch wrote:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccd9cc5..078530c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -201,7 +201,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk 
*disk)

 void nvme_requeue_req(struct request *req)
 {
-   blk_mq_requeue_request(req, true);
+   blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q));
 }
 EXPORT_SYMBOL_GPL(nvme_requeue_req);


Hello Keith,

What I had missed while I was preparing my patch series is that the NVMe 
driver, unlike the dm driver, can call blk_mq_requeue_request() on a 
stopped queue. So the above patch is needed to keep the current 
semantics of the NVMe code. I will merge this patch in my patch series.


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-20 Thread Keith Busch
On Wed, Oct 19, 2016 at 04:51:18PM -0700, Bart Van Assche wrote:
> 
> I assume that line 498 in blk-mq.c corresponds to BUG_ON(blk_queued_rq(rq))?
> Anyway, it seems to me like this is a bug in the NVMe code and also that
> this bug is completely unrelated to my patch series. In nvme_complete_rq() I
> see that blk_mq_requeue_request() is called. I don't think this is allowed
> from the context of nvme_cancel_request() because blk_mq_requeue_request()
> assumes that a request has already been removed from the request list.
> However, neither blk_mq_tagset_busy_iter() nor nvme_cancel_request() remove
> a request from the request list before nvme_complete_rq() is called. I think
> this is what triggers the BUG_ON() statement in blk_mq_requeue_request().
> Have you noticed that e.g. the scsi-mq code only calls
> blk_mq_requeue_request() after __blk_mq_end_request() has finished? Have you
> considered to follow the same approach in nvme_cancel_request()?

Both nvme and scsi requeue through their mp_ops 'complete' callback, so
nvme is similarly waiting for __blk_mq_end_request before requesting to
requeue. The problem, I think, is nvme's IO cancelling path is observing
active requests that it's requeuing from the queue_rq path.

Patch [11/11] kicks the requeue list unconditionally. This restarts queues
the driver had just quiesced a moment before, restarting those requests,
but the driver isn't ready to handle them. When the driver ultimately
unbinds from the device, it requeues those requests a second time.

Either the requeuing can't kick the requeue work when queisced, or the
shutdown needs to quiesce even when it hasn't restarted the queues.
Either patch below appears to fix the issue.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccd9cc5..078530c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -201,7 +201,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk 
*disk)
 
 void nvme_requeue_req(struct request *req)
 {
-   blk_mq_requeue_request(req, true);
+   blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q));
 }
 EXPORT_SYMBOL_GPL(nvme_requeue_req);
--

--- 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4b30fa2..a05da98 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1681,10 +1681,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
del_timer_sync(&dev->watchdog_timer);
 
mutex_lock(&dev->shutdown_lock);
-   if (pci_is_enabled(to_pci_dev(dev->dev))) {
-   nvme_stop_queues(&dev->ctrl);
+   nvme_stop_queues(&dev->ctrl);
+   if (pci_is_enabled(to_pci_dev(dev->dev)))
csts = readl(dev->bar + NVME_REG_CSTS);
-   }
 
queues = dev->online_queues - 1;
for (i = dev->queue_count - 1; i > 0; i--)
--
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-19 Thread Bart Van Assche

On 10/19/2016 03:14 PM, Keith Busch wrote:

I'm running linux 4.9-rc1 + linux-block/for-linus, and alternating tests
with and without this series.

Without this, I'm not seeing any problems in a link-down test while
running fio after ~30 runs.

With this series, I only see the test pass infrequently. Most of the
time I observe one of several failures. In all cases, it looks like the
rq->queuelist is in an unexpected state.

I think I've almost got this tracked down, but I have to leave for the
day soon. Rather than having a more useful suggestion, I've put the two
failures below.


> First failure:
>

[  214.782098] kernel BUG at block/blk-mq.c:498!


Hello Keith,

Thank you for having taken the time to test this patch series. Since I 
think that the second and third failures are consequences of the first, 
I will focus on the first failure triggered by your tests.


I assume that line 498 in blk-mq.c corresponds to 
BUG_ON(blk_queued_rq(rq))? Anyway, it seems to me like this is a bug in 
the NVMe code and also that this bug is completely unrelated to my patch 
series. In nvme_complete_rq() I see that blk_mq_requeue_request() is 
called. I don't think this is allowed from the context of 
nvme_cancel_request() because blk_mq_requeue_request() assumes that a 
request has already been removed from the request list. However, neither 
blk_mq_tagset_busy_iter() nor nvme_cancel_request() remove a request 
from the request list before nvme_complete_rq() is called. I think this 
is what triggers the BUG_ON() statement in blk_mq_requeue_request(). 
Have you noticed that e.g. the scsi-mq code only calls 
blk_mq_requeue_request() after __blk_mq_end_request() has finished? Have 
you considered to follow the same approach in nvme_cancel_request()?


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-19 Thread Keith Busch
Hi Bart,

I'm running linux 4.9-rc1 + linux-block/for-linus, and alternating tests
with and without this series.

Without this, I'm not seeing any problems in a link-down test while
running fio after ~30 runs.

With this series, I only see the test pass infrequently. Most of the
time I observe one of several failures. In all cases, it looks like the
rq->queuelist is in an unexpected state.

I think I've almost got this tracked down, but I have to leave for the
day soon. Rather than having a more useful suggestion, I've put the two
failures below.


First failure:

[  214.782075] [ cut here ]
[  214.782098] kernel BUG at block/blk-mq.c:498!
[  214.782117] invalid opcode:  [#1] SMP
[  214.782133] Modules linked in: nvme nvme_core nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security ebtable_filter 
ebtables ip6table_filter ip6_tables vfat fat
[  214.782356] CPU: 6 PID: 160 Comm: kworker/u16:6 Not tainted 4.9.0-rc1+ #28
[  214.782383] Hardware name: Gigabyte Technology Co., Ltd. 
Z97X-UD5H/Z97X-UD5H, BIOS F8 06/17/2014
[  214.782419] Workqueue: nvme nvme_reset_work [nvme]
[  214.782440] task: 8c0815403b00 task.stack: b6ad01384000
[  214.782463] RIP: 0010:[]  [] 
blk_mq_requeue_request+0x35/0x40
[  214.782502] RSP: 0018:b6ad01387b88  EFLAGS: 00010287
[  214.782524] RAX: 8c0814b98400 RBX: 8c0814b98200 RCX: 7530
[  214.782551] RDX: 0007 RSI: 0001 RDI: 8c0814b98200
[  214.782578] RBP: b6ad01387b98 R08:  R09: 9f408680
[  214.783394] R10: 0394 R11: 0388 R12: 0001
[  214.784212] R13: 8c081593a000 R14: 0001 R15: 8c080cdea740
[  214.785033] FS:  () GS:8c081fb8() 
knlGS:
[  214.785869] CS:  0010 DS:  ES:  CR0: 80050033
[  214.786710] CR2: 7ffae4497f34 CR3: 0001dfe06000 CR4: 001406e0
[  214.787559] Stack:
[  214.788406]  8c0814b98200  b6ad01387ba8 
c03451b3
[  214.789287]  b6ad01387bd0 c0357a4a 8c0814b98200 
d6acffc81a00
[  214.790174]  0006 b6ad01387bf8 9f3b8e22 
8c0814b98200
[  214.791066] Call Trace:
[  214.791935]  [] nvme_requeue_req+0x13/0x20 [nvme_core]
[  214.792810]  [] nvme_complete_rq+0x16a/0x1d0 [nvme]
[  214.793680]  [] __blk_mq_complete_request+0x72/0xe0
[  214.794551]  [] blk_mq_complete_request+0x1c/0x20
[  214.795422]  [] nvme_cancel_request+0x50/0x90 [nvme_core]
[  214.796299]  [] bt_tags_iter+0x2e/0x40
[  214.797157]  [] blk_mq_tagset_busy_iter+0x173/0x1e0
[  214.798005]  [] ? nvme_shutdown_ctrl+0x100/0x100 
[nvme_core]
[  214.798852]  [] ? nvme_shutdown_ctrl+0x100/0x100 
[nvme_core]
[  214.799682]  [] nvme_dev_disable+0x11d/0x380 [nvme]
[  214.800511]  [] ? acpi_unregister_gsi_ioapic+0x3a/0x40
[  214.801344]  [] ? dev_warn+0x6c/0x90
[  214.802157]  [] nvme_reset_work+0xa4/0xdc0 [nvme]
[  214.802961]  [] ? __switch_to+0x2b6/0x5f0
[  214.803773]  [] process_one_work+0x15f/0x430
[  214.804593]  [] worker_thread+0x4e/0x490
[  214.805419]  [] ? process_one_work+0x430/0x430
[  214.806255]  [] kthread+0xd9/0xf0
[  214.807096]  [] ? kthread_park+0x60/0x60
[  214.807946]  [] ret_from_fork+0x25/0x30
[  214.808801] Code: 54 53 48 89 fb 41 89 f4 e8 a9 fa ff ff 48 8b 03 48 39 c3 
75 16 41 0f b6 d4 48 89 df be 01 00 00 00 e8 10 ff ff ff 5b 41 5c 5d c3 <0f> 0b 
66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 be 40 00 00
[  214.810714] RIP  [] blk_mq_requeue_request+0x35/0x40
[  214.811628]  RSP 
[  214.812545] ---[ end trace 6ef3a3b6f8cea418 ]---
[  214.813437] [ cut here ]


Second failure, warning followed by NMI watchdog:

[  410.736619] [ cut here ]
[  410.736624] WARNING: CPU: 2 PID: 577 at lib/list_debug.c:29 
__list_add+0x62/0xb0
[  410.736883] list_add corruption. next->prev should be prev 
(acf481847d78), but was 931f8fb78000. (next=931f8fb78000).
[  410.736884] Modules linked in: nvme nvme_core nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack 
ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_security 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw 
ip6table_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle ebtable_filter 
ebtables ip6table_filter ip6_tables vfat fat
[  410.736902] CPU: 2 PID: 577 Comm: kworker/2:1H Not tainted 4.9.0-rc1+ #28
[  410.736903] Hardware name: Gigabyte Technology Co., Ltd. 
Z97X-UD5H/Z97X-UD5H,

Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-18 Thread Bart Van Assche

On 10/18/2016 02:48 PM, Bart Van Assche wrote:

- blk_mq_quiesce_queue() has been reworked (thanks to Ming Lin and Sagi
   for their feedback).


(replying to my own e-mail)

A correction: Ming Lei provided feedback on v2 of this patch series 
instead of Ming Lin.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-18 Thread Bart Van Assche

Hello Jens,

Multiple block drivers need the functionality to stop a request queue 
and to wait until all ongoing request_fn() / queue_rq() calls have 
finished without waiting until all outstanding requests have finished. 
Hence this patch series that introduces the blk_mq_quiesce_queue() and 
blk_mq_resume_queue() functions. The dm-mq, SRP and NVMe patches in this 
patch series are three examples of where these functions are useful. 
These patches have been tested on top of kernel v4.9-rc1. The following 
tests have been run to verify this patch series:

- My own srp-test suite that stress-tests SRP on top of dm-multipath.
- Mike's mptest suite that stress-tests dm-multipath.
- fio on top of the NVMeOF host driver that was connected to the NVMeOF
  target driver on the same host.
- Laurence verified the previous version (v2) of this patch series by
  running it through the Red Hat SRP test suite. Laurence also ran some
  NVMe tests (thanks Laurence!).

The changes compared to the second version of this patch series are:
- Changed the order of the patches in this patch series.
- Added several new patches: a patch that avoids that .queue_rq() gets
  invoked from the direct submission path if a queue has been stopped
  and also a patch that introduces the helper function
  blk_mq_hctx_stopped().
- blk_mq_quiesce_queue() has been reworked (thanks to Ming Lin and Sagi
  for their feedback).
- A bool 'kick' argument has been added to blk_mq_requeue_request().
- As proposed by Christoph, the code that waits for queuecommand() has
  been moved from the SRP transport driver to the SCSI core.

Changes between v2 and v1:
- Dropped the non-blk-mq changes from this patch series.
- Added support for harware queues with BLK_MQ_F_BLOCKING set.
- Added a call stack to the description of the dm race fix patch.
- Dropped the non-scsi-mq changes from the SRP patch.
- Added a patch that introduces blk_mq_queue_stopped() in the dm driver.

The individual patches in this series are:

0001-blk-mq-Do-not-invoke-.queue_rq-for-a-stopped-queue.patch
0002-blk-mq-Introduce-blk_mq_hctx_stopped.patch
0003-blk-mq-Introduce-blk_mq_queue_stopped.patch
0004-blk-mq-Introduce-blk_mq_quiesce_queue.patch
0005-blk-mq-Add-a-kick_requeue_list-argument-to-blk_mq_re.patch
0006-dm-Use-BLK_MQ_S_STOPPED-instead-of-QUEUE_FLAG_STOPPE.patch
0007-dm-Fix-a-race-condition-related-to-stopping-and-star.patch
0008-SRP-transport-Move-queuecommand-wait-code-to-SCSI-co.patch
0009-SRP-transport-scsi-mq-Wait-for-.queue_rq-if-necessar.patch
0010-nvme-Use-BLK_MQ_S_STOPPED-instead-of-QUEUE_FLAG_STOP.patch
0011-nvme-Fix-a-race-condition.patch

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html