RE: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
-Original Message- From: Wei Fang [mailto:fangw...@huawei.com] Sent: June-01-16 10:37 PM >The concurrently decrements of host_failed only lead to abnormal of >host_failed, host_busy will be zero after error handler, and >the result may >be host_failed > host_busy forever. But in your case, host_busy > host_failed, >so I think it's not the same case. I'm >afraid that this patch can't fix your >scsi hang issue. I tested the patch and you (and others also suggested similar) are correct that it does not fix the error handling issue I am having with port multipliers and CD-ROM+HDD. Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On 2016/6/2 10:37, Wei Fang wrote: > Hi, Kevin, > > On 2016/6/1 22:36, Kevin Groeneveld wrote: >>> Subject: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of >>> ->host_failed >> >> I wonder if this could be related to >> http://www.spinics.net/lists/linux-scsi/msg86808.html? >> >> I never did get to the bottom of that. If I have time I hope to retest my >> scsi hang issue with this patch. >> > > The concurrently decrements of host_failed only lead to abnormal > of host_failed, host_busy will be zero after error handler, and > the result may be host_failed > host_busy forever. But in your > case, host_busy > host_failed, so I think it's not the same > case. I'm afraid that this patch can't fix your scsi hang issue. Something wrong in my words. host_busy may not be zero after error handler, but the result is true that missing decrement of host_failed may lead to host_failed > host_busy. Thanks, Wei -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
Hi, Kevin, On 2016/6/1 22:36, Kevin Groeneveld wrote: >> Subject: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of >> ->host_failed > > I wonder if this could be related to > http://www.spinics.net/lists/linux-scsi/msg86808.html? > > I never did get to the bottom of that. If I have time I hope to retest my > scsi hang issue with this patch. > The concurrently decrements of host_failed only lead to abnormal of host_failed, host_busy will be zero after error handler, and the result may be host_failed > host_busy forever. But in your case, host_busy > host_failed, so I think it's not the same case. I'm afraid that this patch can't fix your scsi hang issue. Thanks, Wei -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
Hi, James, On 2016/6/1 22:06, James Bottomley wrote: > On Tue, 2016-05-31 at 16:38 +0800, Wei Fang wrote: >> sas_ata_strategy_handler() adds the works of the ata error handler >> to system_unbound_wq. This workqueue asynchronously runs work items, >> so the ata error handler will be performed concurrently on different >> CPUs. In this case, ->host_failed will be decreased simultaneously in >> scsi_eh_finish_cmd() on different CPUs, and become abnormal. >> >> It will lead to permanently inequal between ->host_failed and >> ->host_busy, and scsi error handler thread won't become running. >> IO errors after that won't be handled forever. >> >> Use atomic type for ->host_failed to fix this race. > > As I said previously, you don't need atomics to do this, could you just > remove the decrement in scsi_eh_finish_command() and zero the counter > after the strategy handler completes. > OK, I'll send v3 later. Thanks, Wei -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Wed, 2016-06-01 at 08:29 -0700, Bart Van Assche wrote: > On 06/01/2016 07:36 AM, Kevin Groeneveld wrote: > > > Subject: [PATCH v2 1/2] scsi: fix race between simultaneous > > > decrements of ->host_failed > > > > I wonder if this could be related to > > http://www.spinics.net/lists/linux-scsi/msg86808.html? > > > > I never did get to the bottom of that. If I have time I hope to > > retest my scsi hang issue with this patch. > > I've never seen that hang with the ib_srp driver (which is a SCSI > LLD). That probably means that the SATA code uses the SCSI error > handler in another way than other SCSI LLDs. It's the ata_sas error handler, which is split between libsas and libata. It does one thread of execution per ata port when the strategy handler is invoked. This behaviour is pretty much unique at the moment. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On 06/01/2016 07:36 AM, Kevin Groeneveld wrote: Subject: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed I wonder if this could be related to http://www.spinics.net/lists/linux-scsi/msg86808.html? I never did get to the bottom of that. If I have time I hope to retest my scsi hang issue with this patch. I've never seen that hang with the ib_srp driver (which is a SCSI LLD). That probably means that the SATA code uses the SCSI error handler in another way than other SCSI LLDs. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
> Subject: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of > ->host_failed I wonder if this could be related to http://www.spinics.net/lists/linux-scsi/msg86808.html? I never did get to the bottom of that. If I have time I hope to retest my scsi hang issue with this patch. Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Tue, 2016-05-31 at 16:38 +0800, Wei Fang wrote: > sas_ata_strategy_handler() adds the works of the ata error handler > to system_unbound_wq. This workqueue asynchronously runs work items, > so the ata error handler will be performed concurrently on different > CPUs. In this case, ->host_failed will be decreased simultaneously in > scsi_eh_finish_cmd() on different CPUs, and become abnormal. > > It will lead to permanently inequal between ->host_failed and > ->host_busy, and scsi error handler thread won't become running. > IO errors after that won't be handled forever. > > Use atomic type for ->host_failed to fix this race. As I said previously, you don't need atomics to do this, could you just remove the decrement in scsi_eh_finish_command() and zero the counter after the strategy handler completes. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Tue, May 31, 2016 at 8:21 PM, Dan Williams wrote: > On Tue, May 31, 2016 at 1:38 AM, Wei Fang wrote: >> sas_ata_strategy_handler() adds the works of the ata error handler >> to system_unbound_wq. This workqueue asynchronously runs work items, >> so the ata error handler will be performed concurrently on different >> CPUs. In this case, ->host_failed will be decreased simultaneously in >> scsi_eh_finish_cmd() on different CPUs, and become abnormal. >> >> It will lead to permanently inequal between ->host_failed and >> ->host_busy, and scsi error handler thread won't become running. >> IO errors after that won't be handled forever. >> >> Use atomic type for ->host_failed to fix this race. >> >> This fixes the problem introduced in >> commit 50824d6c5657 ("[SCSI] libsas: async ata-eh"). >> >> Reviewed-by: Christoph Hellwig >> Signed-off-by: Wei Fang > > Acked-by: Dan Williams Should also tag this for -stable. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Tue, May 31, 2016 at 1:38 AM, Wei Fang wrote: > sas_ata_strategy_handler() adds the works of the ata error handler > to system_unbound_wq. This workqueue asynchronously runs work items, > so the ata error handler will be performed concurrently on different > CPUs. In this case, ->host_failed will be decreased simultaneously in > scsi_eh_finish_cmd() on different CPUs, and become abnormal. > > It will lead to permanently inequal between ->host_failed and > ->host_busy, and scsi error handler thread won't become running. > IO errors after that won't be handled forever. > > Use atomic type for ->host_failed to fix this race. > > This fixes the problem introduced in > commit 50824d6c5657 ("[SCSI] libsas: async ata-eh"). > > Reviewed-by: Christoph Hellwig > Signed-off-by: Wei Fang Acked-by: Dan Williams -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
Hi, Tejun, On 2016/5/31 22:33, Tejun Heo wrote: > On Tue, May 31, 2016 at 04:38:17PM +0800, Wei Fang wrote: >> sas_ata_strategy_handler() adds the works of the ata error handler >> to system_unbound_wq. This workqueue asynchronously runs work items, > > Are there more than one error handling work items per host? The ata error handler here means async_sas_ata_eh(), every port will execute it's own async_sas_ata_eh() in sas_ata_strategy_handler(). Thanks, Wei > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Tue, May 31, 2016 at 04:38:17PM +0800, Wei Fang wrote: > sas_ata_strategy_handler() adds the works of the ata error handler > to system_unbound_wq. This workqueue asynchronously runs work items, Are there more than one error handling work items per host? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
sas_ata_strategy_handler() adds the works of the ata error handler to system_unbound_wq. This workqueue asynchronously runs work items, so the ata error handler will be performed concurrently on different CPUs. In this case, ->host_failed will be decreased simultaneously in scsi_eh_finish_cmd() on different CPUs, and become abnormal. It will lead to permanently inequal between ->host_failed and ->host_busy, and scsi error handler thread won't become running. IO errors after that won't be handled forever. Use atomic type for ->host_failed to fix this race. This fixes the problem introduced in commit 50824d6c5657 ("[SCSI] libsas: async ata-eh"). Reviewed-by: Christoph Hellwig Signed-off-by: Wei Fang --- drivers/ata/libata-eh.c |2 +- drivers/scsi/libsas/sas_scsi_host.c |5 +++-- drivers/scsi/scsi.c |2 +- drivers/scsi/scsi_error.c | 15 +-- drivers/scsi/scsi_lib.c |3 ++- include/scsi/scsi_host.h|2 +- 6 files changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index 961acc7..a0e7612 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -606,7 +606,7 @@ void ata_scsi_error(struct Scsi_Host *host) ata_scsi_port_error_handler(host, ap); /* finish or retry handled scmd's and clean up */ - WARN_ON(host->host_failed || !list_empty(&eh_work_q)); + WARN_ON(atomic_read(&host->host_failed) || !list_empty(&eh_work_q)); DPRINTK("EXIT\n"); } diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c index 519dac4..8d74003 100644 --- a/drivers/scsi/libsas/sas_scsi_host.c +++ b/drivers/scsi/libsas/sas_scsi_host.c @@ -757,7 +757,8 @@ retry: spin_unlock_irq(shost->host_lock); SAS_DPRINTK("Enter %s busy: %d failed: %d\n", - __func__, atomic_read(&shost->host_busy), shost->host_failed); + __func__, atomic_read(&shost->host_busy), + atomic_read(&shost->host_failed)); /* * Deal with commands that still have SAS tasks (i.e. they didn't * complete via the normal sas_task completion mechanism), @@ -800,7 +801,7 @@ out: SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n", __func__, atomic_read(&shost->host_busy), - shost->host_failed, tries); + atomic_read(&shost->host_failed), tries); } enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd) diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 1deb6ad..7840915 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -527,7 +527,7 @@ void scsi_log_completion(struct scsi_cmnd *cmd, int disposition) scmd_printk(KERN_INFO, cmd, "scsi host busy %d failed %d\n", atomic_read(&cmd->device->host->host_busy), - cmd->device->host->host_failed); + atomic_read(&cmd->device->host->host_failed)); } } } diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 984ddcb..f37666f 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -62,7 +62,8 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *, /* called with shost->host_lock held */ void scsi_eh_wakeup(struct Scsi_Host *shost) { - if (atomic_read(&shost->host_busy) == shost->host_failed) { + if (atomic_read(&shost->host_busy) == + atomic_read(&shost->host_failed)) { trace_scsi_eh_wakeup(shost); wake_up_process(shost->ehandler); SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost, @@ -250,7 +251,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag) eh_flag &= ~SCSI_EH_CANCEL_CMD; scmd->eh_eflags |= eh_flag; list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q); - shost->host_failed++; + atomic_inc(&shost->host_failed); scsi_eh_wakeup(shost); out_unlock: spin_unlock_irqrestore(shost->host_lock, flags); @@ -1127,7 +1128,7 @@ static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn) */ void scsi_eh_finish_cmd(struct scsi_cmnd *scmd, struct list_head *done_q) { - scmd->device->host->host_failed--; + atomic_dec(&scmd->device->host->host_failed); scmd->eh_eflags = 0; list_move_tail(&scmd->eh_entry, done_q); } @@ -2190,8 +2191,10 @@ int scsi_error_handler(void *data) if (kthread_should_stop()) break; - if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) || - shost->host_failed != atomic_read(&shost->host_busy)) { + if ((atomic_read(&shost->host_failed) == 0 && +