Re: [PATCH] pm80xx: remove the SCSI host before detaching from SAS transport

2015-11-09 Thread Martin K. Petersen
> "Benjamin" == Benjamin Rood  writes:

Benjamin> Previously, when this module was unloaded via 'rmmod' with at
Benjamin> least one drive attached, the SCSI error handler thread would
Benjamin> become stuck in an infinite recovery loop and lockup the
Benjamin> system, necessitating a reboot.

Applied.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pm80xx: remove the SCSI host before detaching from SAS transport

2015-11-03 Thread Christoph Hellwig
On Mon, Nov 02, 2015 at 09:29:13AM +0100, Jack Wang wrote:
> Thanks for digging it down. Looks good to me.
> 
> Seems other libsas based driver all affected, maybe we should also fix them?
> Christoph (cc-ed), could you share your opinion about this fix, as the
> commit cff549e4860f is from you?

scsi_remove_host should be done first in ->remove and the commit
was written under that assumption.  So it looks like the other drivers
will need fixes as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pm80xx: remove the SCSI host before detaching from SAS transport

2015-11-02 Thread Jack Wang
2015-10-30 21:01 GMT+01:00 Benjamin Rood :
> Previously, when this module was unloaded via 'rmmod' with at least one
> drive attached, the SCSI error handler thread would become stuck in an
> infinite recovery loop and lockup the system, necessitating a reboot.
>
> Once the SAS layer is detached, the driver will fail any subsequent
> commands since the target devices are removed.  However, removing the
> SCSI host generates a SYNCHRONIZE CACHE (10) command, which was failed
> and left the error handler no method of recovery.
>
> This patch simply removes the SCSI host first so that no more commands
> can come down, prior to cleaning up the SAS layer.  Note that the stack
> is built up with the SCSI host first, and then the SAS layer.  Perhaps
> it should be reversed for symmetry, so that commands cannot be sent to
> the pm80xx driver prior to attaching the SAS layer?
>
> What was really strange about this bug was that it was introduced at
> commit cff549e4860f ("[SCSI]: proper state checking and module refcount
> handling in scsi_device_get").  This commit appears to tinker with how
> the reference counting is performed for SCSI device objects.  My theory
> is that prior to this commit, the refcount for a device object was
> blindly incremented at some point during the teardown process which
> coincidentially made the device stick around during the procedure, which
> also coincidentially made any commands sent to the driver not fail
> (since the device was technically still "there").  After this commit was
> applied, my theory is the refcount for the device object is not being
> incremented at a specific point anymore, which makes the device go away,
> and thus made the pm80xx driver fail any subsequent commands.
>
> You may also want to see the following for more details:
>
> [1] http://www.spinics.net/lists/linux-scsi/msg37208.html
> [2] http://marc.info/?l=linux-scsi&m=144416476406993&w=2
>
> Signed-off-by: Benjamin Rood 
> ---
>  drivers/scsi/pm8001/pm8001_init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/pm8001/pm8001_init.c 
> b/drivers/scsi/pm8001/pm8001_init.c
> index bdc624f..b1b2dd7 100644
> --- a/drivers/scsi/pm8001/pm8001_init.c
> +++ b/drivers/scsi/pm8001/pm8001_init.c
> @@ -1096,10 +1096,10 @@ static void pm8001_pci_remove(struct pci_dev *pdev)
> struct pm8001_hba_info *pm8001_ha;
> int i, j;
> pm8001_ha = sha->lldd_ha;
> +   scsi_remove_host(pm8001_ha->shost);
> sas_unregister_ha(sha);
> sas_remove_host(pm8001_ha->shost);
> list_del(&pm8001_ha->list);
> -   scsi_remove_host(pm8001_ha->shost);
> PM8001_CHIP_DISP->interrupt_disable(pm8001_ha, 0xFF);
> PM8001_CHIP_DISP->chip_soft_rst(pm8001_ha);
>
> --
> 2.4.3
>
Thanks for digging it down. Looks good to me.

Seems other libsas based driver all affected, maybe we should also fix them?
Christoph (cc-ed), could you share your opinion about this fix, as the
commit cff549e4860f is from you?

Regards,
Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pm80xx: remove the SCSI host before detaching from SAS transport

2015-10-30 Thread Benjamin Rood
Previously, when this module was unloaded via 'rmmod' with at least one
drive attached, the SCSI error handler thread would become stuck in an
infinite recovery loop and lockup the system, necessitating a reboot.

Once the SAS layer is detached, the driver will fail any subsequent
commands since the target devices are removed.  However, removing the
SCSI host generates a SYNCHRONIZE CACHE (10) command, which was failed
and left the error handler no method of recovery.

This patch simply removes the SCSI host first so that no more commands
can come down, prior to cleaning up the SAS layer.  Note that the stack
is built up with the SCSI host first, and then the SAS layer.  Perhaps
it should be reversed for symmetry, so that commands cannot be sent to
the pm80xx driver prior to attaching the SAS layer?

What was really strange about this bug was that it was introduced at
commit cff549e4860f ("[SCSI]: proper state checking and module refcount
handling in scsi_device_get").  This commit appears to tinker with how
the reference counting is performed for SCSI device objects.  My theory
is that prior to this commit, the refcount for a device object was
blindly incremented at some point during the teardown process which
coincidentially made the device stick around during the procedure, which
also coincidentially made any commands sent to the driver not fail
(since the device was technically still "there").  After this commit was
applied, my theory is the refcount for the device object is not being
incremented at a specific point anymore, which makes the device go away,
and thus made the pm80xx driver fail any subsequent commands.

You may also want to see the following for more details:

[1] http://www.spinics.net/lists/linux-scsi/msg37208.html
[2] http://marc.info/?l=linux-scsi&m=144416476406993&w=2

Signed-off-by: Benjamin Rood 
---
 drivers/scsi/pm8001/pm8001_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/pm8001/pm8001_init.c 
b/drivers/scsi/pm8001/pm8001_init.c
index bdc624f..b1b2dd7 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -1096,10 +1096,10 @@ static void pm8001_pci_remove(struct pci_dev *pdev)
struct pm8001_hba_info *pm8001_ha;
int i, j;
pm8001_ha = sha->lldd_ha;
+   scsi_remove_host(pm8001_ha->shost);
sas_unregister_ha(sha);
sas_remove_host(pm8001_ha->shost);
list_del(&pm8001_ha->list);
-   scsi_remove_host(pm8001_ha->shost);
PM8001_CHIP_DISP->interrupt_disable(pm8001_ha, 0xFF);
PM8001_CHIP_DISP->chip_soft_rst(pm8001_ha);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html