Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Jaesoo Lee
Please drop this patch. However, it would be happy if this bug can be fixed as soon as possible. Nitzan, do you mind if you send your patch for review? On Tue, Dec 11, 2018 at 3:39 PM Sagi Grimberg wrote: > > > I cannot reproduce the bug with the patch; in my failure scenarios, it > > seems that

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Sagi Grimberg
I cannot reproduce the bug with the patch; in my failure scenarios, it seems that completing the request on errors in nvme_rdma_send_done makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is safe from the double completions. However, it seems that nvme_rdma_timeout code is still

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Jaesoo Lee
I cannot reproduce the bug with the patch; in my failure scenarios, it seems that completing the request on errors in nvme_rdma_send_done makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is safe from the double completions. However, it seems that nvme_rdma_timeout code is still

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Nitzan Carmi
I was just in the middle of sending this to upstream when I saw your mail, and thought too that it addresses the same bug, although I see a little different call trace than yours. I would be happy if you can verify that this patch works for you too, and we can push it to upstream. On

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-10 Thread Jaesoo Lee
It seems that your patch is addressing the same bug. I will see if that works for our failure scenarios. Why don't you make it upstream? On Sun, Dec 9, 2018 at 6:22 AM Nitzan Carmi wrote: > > Hi, > We encountered similar issue. > I think that the problem is that error_recovery might not even be

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-09 Thread Nitzan Carmi
Hi, We encountered similar issue. I think that the problem is that error_recovery might not even be queued, in case we're in DELETING state (or CONNECTING state, for that matter), because we cannot move from those states to RESETTING. We prepared some patches which handle completions in case

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Jaesoo Lee
Now, I see that my patch is not safe and can cause double completions. However, I am having a hard time finding out a good solution to barrier the racing completions. Could you suggest where the fix should go and what should it look like? We can provide more details on reproducing this issue if

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Jaesoo Lee
Now, I see that my patch is not safe and can cause double completions. However, I am having a hard time finding out a good solution to barrier the racing completions. Could you suggest where the fix should go and what should it look like? We can provide more details on reproducing this issue if

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Keith Busch
On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote: > > > Could you please take a look at this bug and code review? > > > > We are seeing more instances of this bug and found that reconnect_work > > could hang as well, as can be seen from below stacktrace. > > > > Workqueue:

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Keith Busch
On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote: > > > Could you please take a look at this bug and code review? > > > > We are seeing more instances of this bug and found that reconnect_work > > could hang as well, as can be seen from below stacktrace. > > > > Workqueue:

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Sagi Grimberg
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Sagi Grimberg
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-06 Thread Jaesoo Lee
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-06 Thread Jaesoo Lee
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Jaesoo Lee
Not the queue, but the RDMA connections. Let me describe the scenario. 1. connected nvme-rdma target with 500 namespaces : this will make the nvme_remove_namespaces() took a long time to complete and open the window vulnerable to this bug 2. host will take below code path for

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Jaesoo Lee
Not the queue, but the RDMA connections. Let me describe the scenario. 1. connected nvme-rdma target with 500 namespaces : this will make the nvme_remove_namespaces() took a long time to complete and open the window vulnerable to this bug 2. host will take below code path for

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Sagi Grimberg
This does not hold at least for NVMe RDMA host driver. An example scenario is when the RDMA connection is gone while the controller is being deleted. In this case, the nvmf_reg_write32() for sending shutdown admin command by the delete_work could be hung forever if the command is not completed

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Sagi Grimberg
This does not hold at least for NVMe RDMA host driver. An example scenario is when the RDMA connection is gone while the controller is being deleted. In this case, the nvmf_reg_write32() for sending shutdown admin command by the delete_work could be hung forever if the command is not completed

[PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Jaesoo Lee
After f6e7d48 (block: remove BLK_EH_HANDLED), the low-level device driver is responsible to complete the timed out request and a series of changes were submitted for various LLDDs to make completions from ->timeout subsequently. However, adding the completion code in NVMe driver was skipped with

[PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Jaesoo Lee
After f6e7d48 (block: remove BLK_EH_HANDLED), the low-level device driver is responsible to complete the timed out request and a series of changes were submitted for various LLDDs to make completions from ->timeout subsequently. However, adding the completion code in NVMe driver was skipped with