Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 02:47:54PM +0200, Johannes Thumshirn wrote: > > > John, mind giving that one a shot in your test setup as well? > > Well, don't mind. It doesn't work in my test setup. > > I'm back to the drawing board... Please ignore my last mail, apparently it's a wise idea to do a git stash pop after a git stash m) -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 02:36:11PM +0200, Jinpu Wang wrote: > On Wed, Mar 29, 2017 at 2:26 PM, Johannes Thumshirn > wrote: > > On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote: > >> On 29/03/2017 12:29, Johannes Thumshirn wrote: > >> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: > >> >>On 29/03/2017 10:41, Johannes Thumshirn wrote: > >> >>>In the advent of an SAS device unregister we have to wait for all > >> >>>destruct > >> >>>works to be done to not accidently delay deletion of a SAS rphy or it's > >> >>>children to the point when we're removing the SCSI or SAS hosts. > >> >>> > >> >>>Signed-off-by: Johannes Thumshirn > >> >>>--- > >> >>>drivers/scsi/libsas/sas_discover.c | 4 > >> >>>1 file changed, 4 insertions(+) > >> >>> > >> >>>diff --git a/drivers/scsi/libsas/sas_discover.c > >> >>>b/drivers/scsi/libsas/sas_discover.c > >> >>>index 60de662..75b18f1 100644 > >> >>>--- a/drivers/scsi/libsas/sas_discover.c > >> >>>+++ b/drivers/scsi/libsas/sas_discover.c > >> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, > >> >>>struct domain_device *dev) > >> >>> } > >> >>> > >> >>> if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { > >> >>>+ struct sas_discovery *disc = &dev->port->disc; > >> >>>+ struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; > >> >>>+ > >> >>> sas_rphy_unlink(dev->rphy); > >> >>> list_move_tail(&dev->disco_list_node, &port->destroy_list); > >> >>> sas_discover_event(dev->port, DISCE_DESTRUCT); > >> >>>+ flush_work(&sw->work); > >> >> > >> >>I quickly tested plugging out the expander and we never get past this > >> >>call > >> >>to flush - a hang results: > >> > > >> >Can you activat lockdep so we can see which lock it is that we're > >> >blocking on? > >> > > >> > >> I have it on: > >> CONFIG_LOCKDEP_SUPPORT=y > >> CONFIG_LOCKD=y > >> CONFIG_LOCKD_V4=y > >> > >> >It's most likely in sas_unregister_common_dev() but this function takes > >> >two spin > >> >locks, port->dev_list_lock and ha->lock. > >> > > >> > >> We can see from the callstack I provided that we're working in workqueue > >> scsi_wq_0 and trying to flush that same queue. > > > > Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I > > admit). > > > > The sas_unregister_dev() comes from the work queued by notify_phy_event(). > > So this patch must be > > replaced by (untested): > > > > diff --git a/drivers/scsi/scsi_transport_sas.c > > b/drivers/scsi/scsi_transport_sas.c > > index cdbb293..e1e6492 100644 > > --- a/drivers/scsi/scsi_transport_sas.c > > +++ b/drivers/scsi/scsi_transport_sas.c > > @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev) > > */ > > void sas_remove_host(struct Scsi_Host *shost) > > { > > + scsi_flush_work(shost); > > sas_remove_children(&shost->shost_gendev); > > } > > EXPORT_SYMBOL(sas_remove_host); > > > > John, mind giving that one a shot in your test setup as well? Well, don't mind. It doesn't work in my test setup. I'm back to the drawing board... Anyways thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 2:26 PM, Johannes Thumshirn wrote: > On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote: >> On 29/03/2017 12:29, Johannes Thumshirn wrote: >> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: >> >>On 29/03/2017 10:41, Johannes Thumshirn wrote: >> >>>In the advent of an SAS device unregister we have to wait for all destruct >> >>>works to be done to not accidently delay deletion of a SAS rphy or it's >> >>>children to the point when we're removing the SCSI or SAS hosts. >> >>> >> >>>Signed-off-by: Johannes Thumshirn >> >>>--- >> >>>drivers/scsi/libsas/sas_discover.c | 4 >> >>>1 file changed, 4 insertions(+) >> >>> >> >>>diff --git a/drivers/scsi/libsas/sas_discover.c >> >>>b/drivers/scsi/libsas/sas_discover.c >> >>>index 60de662..75b18f1 100644 >> >>>--- a/drivers/scsi/libsas/sas_discover.c >> >>>+++ b/drivers/scsi/libsas/sas_discover.c >> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, >> >>>struct domain_device *dev) >> >>> } >> >>> >> >>> if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { >> >>>+ struct sas_discovery *disc = &dev->port->disc; >> >>>+ struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; >> >>>+ >> >>> sas_rphy_unlink(dev->rphy); >> >>> list_move_tail(&dev->disco_list_node, &port->destroy_list); >> >>> sas_discover_event(dev->port, DISCE_DESTRUCT); >> >>>+ flush_work(&sw->work); >> >> >> >>I quickly tested plugging out the expander and we never get past this call >> >>to flush - a hang results: >> > >> >Can you activat lockdep so we can see which lock it is that we're blocking >> >on? >> > >> >> I have it on: >> CONFIG_LOCKDEP_SUPPORT=y >> CONFIG_LOCKD=y >> CONFIG_LOCKD_V4=y >> >> >It's most likely in sas_unregister_common_dev() but this function takes two >> >spin >> >locks, port->dev_list_lock and ha->lock. >> > >> >> We can see from the callstack I provided that we're working in workqueue >> scsi_wq_0 and trying to flush that same queue. > > Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I admit). > > The sas_unregister_dev() comes from the work queued by notify_phy_event(). So > this patch must be > replaced by (untested): > > diff --git a/drivers/scsi/scsi_transport_sas.c > b/drivers/scsi/scsi_transport_sas.c > index cdbb293..e1e6492 100644 > --- a/drivers/scsi/scsi_transport_sas.c > +++ b/drivers/scsi/scsi_transport_sas.c > @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev) > */ > void sas_remove_host(struct Scsi_Host *shost) > { > + scsi_flush_work(shost); > sas_remove_children(&shost->shost_gendev); > } > EXPORT_SYMBOL(sas_remove_host); > > John, mind giving that one a shot in your test setup as well? > > Thanks, > Johannes > > -- > Johannes Thumshirn Storage > jthumsh...@suse.de+49 911 74053 689 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 Haha, I have same idea :) Have no test env, so if John could test it, it will be great. -- Jack Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email:jinpu.w...@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote: > On 29/03/2017 12:29, Johannes Thumshirn wrote: > >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: > >>On 29/03/2017 10:41, Johannes Thumshirn wrote: > >>>In the advent of an SAS device unregister we have to wait for all destruct > >>>works to be done to not accidently delay deletion of a SAS rphy or it's > >>>children to the point when we're removing the SCSI or SAS hosts. > >>> > >>>Signed-off-by: Johannes Thumshirn > >>>--- > >>>drivers/scsi/libsas/sas_discover.c | 4 > >>>1 file changed, 4 insertions(+) > >>> > >>>diff --git a/drivers/scsi/libsas/sas_discover.c > >>>b/drivers/scsi/libsas/sas_discover.c > >>>index 60de662..75b18f1 100644 > >>>--- a/drivers/scsi/libsas/sas_discover.c > >>>+++ b/drivers/scsi/libsas/sas_discover.c > >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, > >>>struct domain_device *dev) > >>> } > >>> > >>> if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { > >>>+ struct sas_discovery *disc = &dev->port->disc; > >>>+ struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; > >>>+ > >>> sas_rphy_unlink(dev->rphy); > >>> list_move_tail(&dev->disco_list_node, &port->destroy_list); > >>> sas_discover_event(dev->port, DISCE_DESTRUCT); > >>>+ flush_work(&sw->work); > >> > >>I quickly tested plugging out the expander and we never get past this call > >>to flush - a hang results: > > > >Can you activat lockdep so we can see which lock it is that we're blocking > >on? > > > > I have it on: > CONFIG_LOCKDEP_SUPPORT=y > CONFIG_LOCKD=y > CONFIG_LOCKD_V4=y > > >It's most likely in sas_unregister_common_dev() but this function takes two > >spin > >locks, port->dev_list_lock and ha->lock. > > > > We can see from the callstack I provided that we're working in workqueue > scsi_wq_0 and trying to flush that same queue. Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I admit). The sas_unregister_dev() comes from the work queued by notify_phy_event(). So this patch must be replaced by (untested): diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index cdbb293..e1e6492 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev) */ void sas_remove_host(struct Scsi_Host *shost) { + scsi_flush_work(shost); sas_remove_children(&shost->shost_gendev); } EXPORT_SYMBOL(sas_remove_host); John, mind giving that one a shot in your test setup as well? Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On 29/03/2017 12:29, Johannes Thumshirn wrote: On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: On 29/03/2017 10:41, Johannes Thumshirn wrote: In the advent of an SAS device unregister we have to wait for all destruct works to be done to not accidently delay deletion of a SAS rphy or it's children to the point when we're removing the SCSI or SAS hosts. Signed-off-by: Johannes Thumshirn --- drivers/scsi/libsas/sas_discover.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de662..75b18f1 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev) } if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { + struct sas_discovery *disc = &dev->port->disc; + struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; + sas_rphy_unlink(dev->rphy); list_move_tail(&dev->disco_list_node, &port->destroy_list); sas_discover_event(dev->port, DISCE_DESTRUCT); + flush_work(&sw->work); I quickly tested plugging out the expander and we never get past this call to flush - a hang results: Can you activat lockdep so we can see which lock it is that we're blocking on? I have it on: CONFIG_LOCKDEP_SUPPORT=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y It's most likely in sas_unregister_common_dev() but this function takes two spin locks, port->dev_list_lock and ha->lock. We can see from the callstack I provided that we're working in workqueue scsi_wq_0 and trying to flush that same queue. Much appreciated, John Thanks a lot, Johannes
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote: > On 29/03/2017 10:41, Johannes Thumshirn wrote: > >In the advent of an SAS device unregister we have to wait for all destruct > >works to be done to not accidently delay deletion of a SAS rphy or it's > >children to the point when we're removing the SCSI or SAS hosts. > > > >Signed-off-by: Johannes Thumshirn > >--- > > drivers/scsi/libsas/sas_discover.c | 4 > > 1 file changed, 4 insertions(+) > > > >diff --git a/drivers/scsi/libsas/sas_discover.c > >b/drivers/scsi/libsas/sas_discover.c > >index 60de662..75b18f1 100644 > >--- a/drivers/scsi/libsas/sas_discover.c > >+++ b/drivers/scsi/libsas/sas_discover.c > >@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, > >struct domain_device *dev) > > } > > > > if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { > >+struct sas_discovery *disc = &dev->port->disc; > >+struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; > >+ > > sas_rphy_unlink(dev->rphy); > > list_move_tail(&dev->disco_list_node, &port->destroy_list); > > sas_discover_event(dev->port, DISCE_DESTRUCT); > >+flush_work(&sw->work); > > I quickly tested plugging out the expander and we never get past this call > to flush - a hang results: Can you activat lockdep so we can see which lock it is that we're blocking on? It's most likely in sas_unregister_common_dev() but this function takes two spin locks, port->dev_list_lock and ha->lock. Thanks a lot, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
On 29/03/2017 10:41, Johannes Thumshirn wrote: In the advent of an SAS device unregister we have to wait for all destruct works to be done to not accidently delay deletion of a SAS rphy or it's children to the point when we're removing the SCSI or SAS hosts. Signed-off-by: Johannes Thumshirn --- drivers/scsi/libsas/sas_discover.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de662..75b18f1 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev) } if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { + struct sas_discovery *disc = &dev->port->disc; + struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; + sas_rphy_unlink(dev->rphy); list_move_tail(&dev->disco_list_node, &port->destroy_list); sas_discover_event(dev->port, DISCE_DESTRUCT); + flush_work(&sw->work); I quickly tested plugging out the expander and we never get past this call to flush - a hang results: root@(none)$ [ 243.357088] INFO: task kworker/u32:1:106 blocked for more than 120 seconds. [ 243.364030] Not tainted 4.11.0-rc1-13687-g2562e6a-dirty #1388 [ 243.370282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.378086] kworker/u32:1 D0 106 2 0x [ 243.383566] Workqueue: scsi_wq_0 sas_phye_loss_of_signal [ 243.388863] Call trace: [ 243.391314] [] __switch_to+0xa4/0xb0 [ 243.396442] [] __schedule+0x1b4/0x5d0 [ 243.401654] [] schedule+0x38/0x9c [ 243.406520] [] schedule_timeout+0x194/0x294 [ 243.412249] [] wait_for_common+0xb0/0x144 [ 243.417805] [] wait_for_completion+0x14/0x1c [ 243.423623] [] flush_work+0xe0/0x1a8 [ 243.428747] [] sas_unregister_dev+0xf8/0x110 [ 243.434563] [] sas_unregister_domain_devices+0x4c/0xc8 [ 243.441242] [] sas_deform_port+0x14c/0x15c [ 243.446886] [] sas_phye_loss_of_signal+0x48/0x54 [ 243.453048] [] process_one_work+0x138/0x2d8 [ 243.458776] [] worker_thread+0x58/0x424 [ 243.464161] [] kthread+0xf4/0x120 [ 243.469024] [] ret_from_fork+0x10/0x50 [ 364.189094] INFO: task kworker/u32:1:106 blocked for more than 120 seconds. [ 364.196035] Not tainted 4.11.0-rc1-13687-g2562e6a-dirty #1388 [ 364.202281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 364.210085] kworker/u32:1 D0 106 2 0x [ 364.215558] Workqueue: scsi_wq_0 sas_phye_loss_of_signal [ 364.220855] Call trace: [ 364.223303] [] __switch_to+0xa4/0xb0 [ 364.228428] [] __schedule+0x1b4/0x5d0 [ 364.233640] [] schedule+0x38/0x9c [ 364.238506] [] schedule_timeout+0x194/0x294 [ 364.244237] [] wait_for_common+0xb0/0x144 [ 364.249793] [] wait_for_completion+0x14/0x1c [ 364.255610] [] flush_work+0xe0/0x1a8 [ 364.260736] [] sas_unregister_dev+0xf8/0x110 [ 364.266551] [] sas_unregister_domain_devices+0x4c/0xc8 [ 364.273230] [] sas_deform_port+0x14c/0x15c [ 364.278872] [] sas_phye_loss_of_signal+0x48/0x54 [ 364.285034] [] process_one_work+0x138/0x2d8 [ 364.290763] [] worker_thread+0x58/0x424 [ 364.296147] [] kthread+0xf4/0x120 [ 364.301013] [] ret_from_fork+0x10/0x50 Is the issue that we are trying to flush the queue when we are working in the same queue context? Thanks, John } }
[PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister
In the advent of an SAS device unregister we have to wait for all destruct works to be done to not accidently delay deletion of a SAS rphy or it's children to the point when we're removing the SCSI or SAS hosts. Signed-off-by: Johannes Thumshirn --- drivers/scsi/libsas/sas_discover.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de662..75b18f1 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev) } if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) { + struct sas_discovery *disc = &dev->port->disc; + struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work; + sas_rphy_unlink(dev->rphy); list_move_tail(&dev->disco_list_node, &port->destroy_list); sas_discover_event(dev->port, DISCE_DESTRUCT); + flush_work(&sw->work); } } -- 1.8.5.6