Please drop this patch. However, it would be happy if this bug can be
fixed as soon as possible.
Nitzan, do you mind if you send your patch for review?
On Tue, Dec 11, 2018 at 3:39 PM Sagi Grimberg wrote:
>
> > I cannot reproduce the bug with the patch; in my failure scenarios, it
> > seems that
I cannot reproduce the bug with the patch; in my failure scenarios, it
seems that completing the request on errors in nvme_rdma_send_done
makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is
safe from the double completions.
However, it seems that nvme_rdma_timeout code is still
I cannot reproduce the bug with the patch; in my failure scenarios, it
seems that completing the request on errors in nvme_rdma_send_done
makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is
safe from the double completions.
However, it seems that nvme_rdma_timeout code is still
I was just in the middle of sending this to upstream when I saw your
mail, and thought too that it addresses the same bug, although I see a
little different call trace than yours.
I would be happy if you can verify that this patch works for you too,
and we can push it to upstream.
On
It seems that your patch is addressing the same bug. I will see if
that works for our failure scenarios.
Why don't you make it upstream?
On Sun, Dec 9, 2018 at 6:22 AM Nitzan Carmi wrote:
>
> Hi,
> We encountered similar issue.
> I think that the problem is that error_recovery might not even be
Hi,
We encountered similar issue.
I think that the problem is that error_recovery might not even be
queued, in case we're in DELETING state (or CONNECTING state, for that
matter), because we cannot move from those states to RESETTING.
We prepared some patches which handle completions in case
Now, I see that my patch is not safe and can cause double completions.
However, I am having a hard time finding out a good solution to
barrier the racing completions.
Could you suggest where the fix should go and what should it look
like? We can provide more details on reproducing this issue if
Now, I see that my patch is not safe and can cause double completions.
However, I am having a hard time finding out a good solution to
barrier the racing completions.
Could you suggest where the fix should go and what should it look
like? We can provide more details on reproducing this issue if
On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote:
>
> > Could you please take a look at this bug and code review?
> >
> > We are seeing more instances of this bug and found that reconnect_work
> > could hang as well, as can be seen from below stacktrace.
> >
> > Workqueue:
On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote:
>
> > Could you please take a look at this bug and code review?
> >
> > We are seeing more instances of this bug and found that reconnect_work
> > could hang as well, as can be seen from below stacktrace.
> >
> > Workqueue:
Could you please take a look at this bug and code review?
We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.
Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Call Trace:
__schedule+0x2ab/0x880
Could you please take a look at this bug and code review?
We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.
Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Call Trace:
__schedule+0x2ab/0x880
Could you please take a look at this bug and code review?
We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.
Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Call Trace:
__schedule+0x2ab/0x880
Could you please take a look at this bug and code review?
We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.
Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Call Trace:
__schedule+0x2ab/0x880
Not the queue, but the RDMA connections.
Let me describe the scenario.
1. connected nvme-rdma target with 500 namespaces
: this will make the nvme_remove_namespaces() took a long time to
complete and open the window vulnerable to this bug
2. host will take below code path for
Not the queue, but the RDMA connections.
Let me describe the scenario.
1. connected nvme-rdma target with 500 namespaces
: this will make the nvme_remove_namespaces() took a long time to
complete and open the window vulnerable to this bug
2. host will take below code path for
This does not hold at least for NVMe RDMA host driver. An example scenario
is when the RDMA connection is gone while the controller is being deleted.
In this case, the nvmf_reg_write32() for sending shutdown admin command by
the delete_work could be hung forever if the command is not completed
This does not hold at least for NVMe RDMA host driver. An example scenario
is when the RDMA connection is gone while the controller is being deleted.
In this case, the nvmf_reg_write32() for sending shutdown admin command by
the delete_work could be hung forever if the command is not completed
After f6e7d48 (block: remove BLK_EH_HANDLED), the low-level device driver
is responsible to complete the timed out request and a series of changes
were submitted for various LLDDs to make completions from ->timeout
subsequently. However, adding the completion code in NVMe driver was
skipped with
After f6e7d48 (block: remove BLK_EH_HANDLED), the low-level device driver
is responsible to complete the timed out request and a series of changes
were submitted for various LLDDs to make completions from ->timeout
subsequently. However, adding the completion code in NVMe driver was
skipped with
20 matches
Mail list logo