Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 02:47:54PM +0200, Johannes Thumshirn wrote: > > > John, mind giving that one a shot in your test setup as well? > > Well, don't mind. It doesn't work in my test setup. > > I'm back to the drawing board... Please ignore my last mail, apparently it's a wise idea to do a git stash pop after a git stash m) -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 02:36:11PM +0200, Jinpu Wang wrote: > On Wed, Mar 29, 2017 at 2:26 PM, Johannes Thumshirn > wrote: > > On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote: > >> On 29/03/2017 12:29, Johannes Thumshirn wrote: > >> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: > >> >>On 29/03/2017 10:41, Johannes Thumshirn wrote: > >> >>>In the advent of an SAS device unregister we have to wait for all > >> >>>destruct > >> >>>works to be done to not accidently delay deletion of a SAS rphy or it's > >> >>>children to the point when we're removing the SCSI or SAS hosts. > >> >>> > >> >>>Signed-off-by: Johannes Thumshirn > >> >>>--- > >> >>>drivers/scsi/libsas/sas_discover.c | 4 > >> >>>1 file changed, 4 insertions(+) > >> >>> > >> >>>diff --git a/drivers/scsi/libsas/sas_discover.c > >> >>>b/drivers/scsi/libsas/sas_discover.c > >> >>>index 60de662..75b18f1 100644 > >> >>>--- a/drivers/scsi/libsas/sas_discover.c > >> >>>+++ b/drivers/scsi/libsas/sas_discover.c > >> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, > >> >>>struct domain_device *dev) > >> >>> } > >> >>> > >> >>> if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { > >> >>>+ struct sas_discovery *disc = &dev->port->disc; > >> >>>+ struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; > >> >>>+ > >> >>> sas_rphy_unlink(dev->rphy); > >> >>> list_move_tail(&dev->disco_list_node, &port->destroy_list); > >> >>> sas_discover_event(dev->port, DISCE_DESTRUCT); > >> >>>+ flush_work(&sw->work); > >> >> > >> >>I quickly tested plugging out the expander and we never get past this > >> >>call > >> >>to flush - a hang results: > >> > > >> >Can you activat lockdep so we can see which lock it is that we're > >> >blocking on? > >> > > >> > >> I have it on: > >> CONFIG_LOCKDEP_SUPPORT=y > >> CONFIG_LOCKD=y > >> CONFIG_LOCKD_V4=y > >> > >> >It's most likely in sas_unregister_common_dev() but this function takes > >> >two spin > >> >locks, port->dev_list_lock and ha->lock. > >> > > >> > >> We can see from the callstack I provided that we're working in workqueue > >> scsi_wq_0 and trying to flush that same queue. > > > > Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I > > admit). > > > > The sas_unregister_dev() comes from the work queued by notify_phy_event(). > > So this patch must be > > replaced by (untested): > > > > diff --git a/drivers/scsi/scsi_transport_sas.c > > b/drivers/scsi/scsi_transport_sas.c > > index cdbb293..e1e6492 100644 > > --- a/drivers/scsi/scsi_transport_sas.c > > +++ b/drivers/scsi/scsi_transport_sas.c > > @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev) > > */ > > void sas_remove_host(struct Scsi_Host *shost) > > { > > + scsi_flush_work(shost); > > sas_remove_children(&shost->shost_gendev); > > } > > EXPORT_SYMBOL(sas_remove_host); > > > > John, mind giving that one a shot in your test setup as well? Well, don't mind. It doesn't work in my test setup. I'm back to the drawing board... Anyways thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 2:26 PM, Johannes Thumshirn wrote: > On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote: >> On 29/03/2017 12:29, Johannes Thumshirn wrote: >> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: >> >>On 29/03/2017 10:41, Johannes Thumshirn wrote: >> >>>In the advent of an SAS device unregister we have to wait for all destruct >> >>>works to be done to not accidently delay deletion of a SAS rphy or it's >> >>>children to the point when we're removing the SCSI or SAS hosts. >> >>> >> >>>Signed-off-by: Johannes Thumshirn >> >>>--- >> >>>drivers/scsi/libsas/sas_discover.c | 4 >> >>>1 file changed, 4 insertions(+) >> >>> >> >>>diff --git a/drivers/scsi/libsas/sas_discover.c >> >>>b/drivers/scsi/libsas/sas_discover.c >> >>>index 60de662..75b18f1 100644 >> >>>--- a/drivers/scsi/libsas/sas_discover.c >> >>>+++ b/drivers/scsi/libsas/sas_discover.c >> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, >> >>>struct domain_device *dev) >> >>> } >> >>> >> >>> if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { >> >>>+ struct sas_discovery *disc = &dev->port->disc; >> >>>+ struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; >> >>>+ >> >>> sas_rphy_unlink(dev->rphy); >> >>> list_move_tail(&dev->disco_list_node, &port->destroy_list); >> >>> sas_discover_event(dev->port, DISCE_DESTRUCT); >> >>>+ flush_work(&sw->work); >> >> >> >>I quickly tested plugging out the expander and we never get past this call >> >>to flush - a hang results: >> > >> >Can you activat lockdep so we can see which lock it is that we're blocking >> >on? >> > >> >> I have it on: >> CONFIG_LOCKDEP_SUPPORT=y >> CONFIG_LOCKD=y >> CONFIG_LOCKD_V4=y >> >> >It's most likely in sas_unregister_common_dev() but this function takes two >> >spin >> >locks, port->dev_list_lock and ha->lock. >> > >> >> We can see from the callstack I provided that we're working in workqueue >> scsi_wq_0 and trying to flush that same queue. > > Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I admit). > > The sas_unregister_dev() comes from the work queued by notify_phy_event(). So > this patch must be > replaced by (untested): > > diff --git a/drivers/scsi/scsi_transport_sas.c > b/drivers/scsi/scsi_transport_sas.c > index cdbb293..e1e6492 100644 > --- a/drivers/scsi/scsi_transport_sas.c > +++ b/drivers/scsi/scsi_transport_sas.c > @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev) > */ > void sas_remove_host(struct Scsi_Host *shost) > { > + scsi_flush_work(shost); > sas_remove_children(&shost->shost_gendev); > } > EXPORT_SYMBOL(sas_remove_host); > > John, mind giving that one a shot in your test setup as well? > > Thanks, > Johannes > > -- > Johannes Thumshirn Storage > jthumsh...@suse.de+49 911 74053 689 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 Haha, I have same idea :) Have no test env, so if John could test it, it will be great. -- Jack Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email:jinpu.w...@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote: > On 29/03/2017 12:29, Johannes Thumshirn wrote: > >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: > >>On 29/03/2017 10:41, Johannes Thumshirn wrote: > >>>In the advent of an SAS device unregister we have to wait for all destruct > >>>works to be done to not accidently delay deletion of a SAS rphy or it's > >>>children to the point when we're removing the SCSI or SAS hosts. > >>> > >>>Signed-off-by: Johannes Thumshirn > >>>--- > >>>drivers/scsi/libsas/sas_discover.c | 4 > >>>1 file changed, 4 insertions(+) > >>> > >>>diff --git a/drivers/scsi/libsas/sas_discover.c > >>>b/drivers/scsi/libsas/sas_discover.c > >>>index 60de662..75b18f1 100644 > >>>--- a/drivers/scsi/libsas/sas_discover.c > >>>+++ b/drivers/scsi/libsas/sas_discover.c > >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, > >>>struct domain_device *dev) > >>> } > >>> > >>> if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { > >>>+ struct sas_discovery *disc = &dev->port->disc; > >>>+ struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; > >>>+ > >>> sas_rphy_unlink(dev->rphy); > >>> list_move_tail(&dev->disco_list_node, &port->destroy_list); > >>> sas_discover_event(dev->port, DISCE_DESTRUCT); > >>>+ flush_work(&sw->work); > >> > >>I quickly tested plugging out the expander and we never get past this call > >>to flush - a hang results: > > > >Can you activat lockdep so we can see which lock it is that we're blocking > >on? > > > > I have it on: > CONFIG_LOCKDEP_SUPPORT=y > CONFIG_LOCKD=y > CONFIG_LOCKD_V4=y > > >It's most likely in sas_unregister_common_dev() but this function takes two > >spin > >locks, port->dev_list_lock and ha->lock. > > > > We can see from the callstack I provided that we're working in workqueue > scsi_wq_0 and trying to flush that same queue. Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I admit). The sas_unregister_dev() comes from the work queued by notify_phy_event(). So this patch must be replaced by (untested): diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index cdbb293..e1e6492 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev) */ void sas_remove_host(struct Scsi_Host *shost) { + scsi_flush_work(shost); sas_remove_children(&shost->shost_gendev); } EXPORT_SYMBOL(sas_remove_host); John, mind giving that one a shot in your test setup as well? Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On 29/03/2017 12:29, Johannes Thumshirn wrote: On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: On 29/03/2017 10:41, Johannes Thumshirn wrote: In the advent of an SAS device unregister we have to wait for all destruct works to be done to not accidently delay deletion of a SAS rphy or it's children to the point when we're removing the SCSI or SAS hosts. Signed-off-by: Johannes Thumshirn --- drivers/scsi/libsas/sas_discover.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de662..75b18f1 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev) } if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { + struct sas_discovery *disc = &dev->port->disc; + struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; + sas_rphy_unlink(dev->rphy); list_move_tail(&dev->disco_list_node, &port->destroy_list); sas_discover_event(dev->port, DISCE_DESTRUCT); + flush_work(&sw->work); I quickly tested plugging out the expander and we never get past this call to flush - a hang results: Can you activat lockdep so we can see which lock it is that we're blocking on? I have it on: CONFIG_LOCKDEP_SUPPORT=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y It's most likely in sas_unregister_common_dev() but this function takes two spin locks, port->dev_list_lock and ha->lock. We can see from the callstack I provided that we're working in workqueue scsi_wq_0 and trying to flush that same queue. Much appreciated, John Thanks a lot, Johannes
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: > On 29/03/2017 10:41, Johannes Thumshirn wrote: > >In the advent of an SAS device unregister we have to wait for all destruct > >works to be done to not accidently delay deletion of a SAS rphy or it's > >children to the point when we're removing the SCSI or SAS hosts. > > > >Signed-off-by: Johannes Thumshirn > >--- > > drivers/scsi/libsas/sas_discover.c | 4 > > 1 file changed, 4 insertions(+) > > > >diff --git a/drivers/scsi/libsas/sas_discover.c > >b/drivers/scsi/libsas/sas_discover.c > >index 60de662..75b18f1 100644 > >--- a/drivers/scsi/libsas/sas_discover.c > >+++ b/drivers/scsi/libsas/sas_discover.c > >@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, > >struct domain_device *dev) > > } > > > > if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { > >+struct sas_discovery *disc = &dev->port->disc; > >+struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; > >+ > > sas_rphy_unlink(dev->rphy); > > list_move_tail(&dev->disco_list_node, &port->destroy_list); > > sas_discover_event(dev->port, DISCE_DESTRUCT); > >+flush_work(&sw->work); > > I quickly tested plugging out the expander and we never get past this call > to flush - a hang results: Can you activat lockdep so we can see which lock it is that we're blocking on? It's most likely in sas_unregister_common_dev() but this function takes two spin locks, port->dev_list_lock and ha->lock. Thanks a lot, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On 29/03/2017 10:41, Johannes Thumshirn wrote: In the advent of an SAS device unregister we have to wait for all destruct works to be done to not accidently delay deletion of a SAS rphy or it's children to the point when we're removing the SCSI or SAS hosts. Signed-off-by: Johannes Thumshirn --- drivers/scsi/libsas/sas_discover.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de662..75b18f1 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev) } if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { + struct sas_discovery *disc = &dev->port->disc; + struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; + sas_rphy_unlink(dev->rphy); list_move_tail(&dev->disco_list_node, &port->destroy_list); sas_discover_event(dev->port, DISCE_DESTRUCT); + flush_work(&sw->work); I quickly tested plugging out the expander and we never get past this call to flush - a hang results: root@(none)$ [ 243.357088] INFO: task kworker/u32:1:106 blocked for more than 120 seconds. [ 243.364030] Not tainted 4.11.0-rc1-13687-g2562e6a-dirty #1388 [ 243.370282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.378086] kworker/u32:1 D0 106 2 0x [ 243.383566] Workqueue: scsi_wq_0 sas_phye_loss_of_signal [ 243.388863] Call trace: [ 243.391314] [] __switch_to+0xa4/0xb0 [ 243.396442] [] __schedule+0x1b4/0x5d0 [ 243.401654] [] schedule+0x38/0x9c [ 243.406520] [] schedule_timeout+0x194/0x294 [ 243.412249] [] wait_for_common+0xb0/0x144 [ 243.417805] [] wait_for_completion+0x14/0x1c [ 243.423623] [] flush_work+0xe0/0x1a8 [ 243.428747] [] sas_unregister_dev+0xf8/0x110 [ 243.434563] [] sas_unregister_domain_devices+0x4c/0xc8 [ 243.441242] [] sas_deform_port+0x14c/0x15c [ 243.446886] [] sas_phye_loss_of_signal+0x48/0x54 [ 243.453048] [] process_one_work+0x138/0x2d8 [ 243.458776] [] worker_thread+0x58/0x424 [ 243.464161] [] kthread+0xf4/0x120 [ 243.469024] [] ret_from_fork+0x10/0x50 [ 364.189094] INFO: task kworker/u32:1:106 blocked for more than 120 seconds. [ 364.196035] Not tainted 4.11.0-rc1-13687-g2562e6a-dirty #1388 [ 364.202281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 364.210085] kworker/u32:1 D0 106 2 0x [ 364.215558] Workqueue: scsi_wq_0 sas_phye_loss_of_signal [ 364.220855] Call trace: [ 364.223303] [] __switch_to+0xa4/0xb0 [ 364.228428] [] __schedule+0x1b4/0x5d0 [ 364.233640] [] schedule+0x38/0x9c [ 364.238506] [] schedule_timeout+0x194/0x294 [ 364.244237] [] wait_for_common+0xb0/0x144 [ 364.249793] [] wait_for_completion+0x14/0x1c [ 364.255610] [] flush_work+0xe0/0x1a8 [ 364.260736] [] sas_unregister_dev+0xf8/0x110 [ 364.266551] [] sas_unregister_domain_devices+0x4c/0xc8 [ 364.273230] [] sas_deform_port+0x14c/0x15c [ 364.278872] [] sas_phye_loss_of_signal+0x48/0x54 [ 364.285034] [] process_one_work+0x138/0x2d8 [ 364.290763] [] worker_thread+0x58/0x424 [ 364.296147] [] kthread+0xf4/0x120 [ 364.301013] [] ret_from_fork+0x10/0x50 Is the issue that we are trying to flush the queue when we are working in the same queue context? Thanks, John } }