Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread Johannes Thumshirn
On Wed, Mar 29, 2017 at 02:47:54PM +0200, Johannes Thumshirn wrote:
> > > John, mind giving that one a shot in your test setup as well?
> 
> Well, don't mind. It doesn't work in my test setup.
> 
> I'm back to the drawing board...

Please ignore my last mail, apparently it's a wise idea to do a git stash pop
after a git stash m)

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread Johannes Thumshirn
On Wed, Mar 29, 2017 at 02:36:11PM +0200, Jinpu Wang wrote:
> On Wed, Mar 29, 2017 at 2:26 PM, Johannes Thumshirn  
> wrote:
> > On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote:
> >> On 29/03/2017 12:29, Johannes Thumshirn wrote:
> >> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote:
> >> >>On 29/03/2017 10:41, Johannes Thumshirn wrote:
> >> >>>In the advent of an SAS device unregister we have to wait for all 
> >> >>>destruct
> >> >>>works to be done to not accidently delay deletion of a SAS rphy or it's
> >> >>>children to the point when we're removing the SCSI or SAS hosts.
> >> >>>
> >> >>>Signed-off-by: Johannes Thumshirn 
> >> >>>---
> >> >>>drivers/scsi/libsas/sas_discover.c | 4 
> >> >>>1 file changed, 4 insertions(+)
> >> >>>
> >> >>>diff --git a/drivers/scsi/libsas/sas_discover.c 
> >> >>>b/drivers/scsi/libsas/sas_discover.c
> >> >>>index 60de662..75b18f1 100644
> >> >>>--- a/drivers/scsi/libsas/sas_discover.c
> >> >>>+++ b/drivers/scsi/libsas/sas_discover.c
> >> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, 
> >> >>>struct domain_device *dev)
> >> >>>   }
> >> >>>
> >> >>>   if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
> >> >>>+  struct sas_discovery *disc = &dev->port->disc;
> >> >>>+  struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
> >> >>>+
> >> >>>   sas_rphy_unlink(dev->rphy);
> >> >>>   list_move_tail(&dev->disco_list_node, &port->destroy_list);
> >> >>>   sas_discover_event(dev->port, DISCE_DESTRUCT);
> >> >>>+  flush_work(&sw->work);
> >> >>
> >> >>I quickly tested plugging out the expander and we never get past this 
> >> >>call
> >> >>to flush - a hang results:
> >> >
> >> >Can you activat lockdep so we can see which lock it is that we're 
> >> >blocking on?
> >> >
> >>
> >> I have it on:
> >> CONFIG_LOCKDEP_SUPPORT=y
> >> CONFIG_LOCKD=y
> >> CONFIG_LOCKD_V4=y
> >>
> >> >It's most likely in sas_unregister_common_dev() but this function takes 
> >> >two spin
> >> >locks, port->dev_list_lock and ha->lock.
> >> >
> >>
> >> We can see from the callstack I provided that we're working in workqueue
> >> scsi_wq_0 and trying to flush that same queue.
> >
> > Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I 
> > admit).
> >
> > The sas_unregister_dev() comes from the work queued by notify_phy_event(). 
> > So this patch must be
> > replaced by (untested):
> >
> > diff --git a/drivers/scsi/scsi_transport_sas.c 
> > b/drivers/scsi/scsi_transport_sas.c
> > index cdbb293..e1e6492 100644
> > --- a/drivers/scsi/scsi_transport_sas.c
> > +++ b/drivers/scsi/scsi_transport_sas.c
> > @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev)
> >   */
> >  void sas_remove_host(struct Scsi_Host *shost)
> >  {
> > +   scsi_flush_work(shost);
> > sas_remove_children(&shost->shost_gendev);
> >  }
> >  EXPORT_SYMBOL(sas_remove_host);
> >
> > John, mind giving that one a shot in your test setup as well?

Well, don't mind. It doesn't work in my test setup.

I'm back to the drawing board...

Anyways thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread Jinpu Wang
On Wed, Mar 29, 2017 at 2:26 PM, Johannes Thumshirn  wrote:
> On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote:
>> On 29/03/2017 12:29, Johannes Thumshirn wrote:
>> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote:
>> >>On 29/03/2017 10:41, Johannes Thumshirn wrote:
>> >>>In the advent of an SAS device unregister we have to wait for all destruct
>> >>>works to be done to not accidently delay deletion of a SAS rphy or it's
>> >>>children to the point when we're removing the SCSI or SAS hosts.
>> >>>
>> >>>Signed-off-by: Johannes Thumshirn 
>> >>>---
>> >>>drivers/scsi/libsas/sas_discover.c | 4 
>> >>>1 file changed, 4 insertions(+)
>> >>>
>> >>>diff --git a/drivers/scsi/libsas/sas_discover.c 
>> >>>b/drivers/scsi/libsas/sas_discover.c
>> >>>index 60de662..75b18f1 100644
>> >>>--- a/drivers/scsi/libsas/sas_discover.c
>> >>>+++ b/drivers/scsi/libsas/sas_discover.c
>> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, 
>> >>>struct domain_device *dev)
>> >>>   }
>> >>>
>> >>>   if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
>> >>>+  struct sas_discovery *disc = &dev->port->disc;
>> >>>+  struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
>> >>>+
>> >>>   sas_rphy_unlink(dev->rphy);
>> >>>   list_move_tail(&dev->disco_list_node, &port->destroy_list);
>> >>>   sas_discover_event(dev->port, DISCE_DESTRUCT);
>> >>>+  flush_work(&sw->work);
>> >>
>> >>I quickly tested plugging out the expander and we never get past this call
>> >>to flush - a hang results:
>> >
>> >Can you activat lockdep so we can see which lock it is that we're blocking 
>> >on?
>> >
>>
>> I have it on:
>> CONFIG_LOCKDEP_SUPPORT=y
>> CONFIG_LOCKD=y
>> CONFIG_LOCKD_V4=y
>>
>> >It's most likely in sas_unregister_common_dev() but this function takes two 
>> >spin
>> >locks, port->dev_list_lock and ha->lock.
>> >
>>
>> We can see from the callstack I provided that we're working in workqueue
>> scsi_wq_0 and trying to flush that same queue.
>
> Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I admit).
>
> The sas_unregister_dev() comes from the work queued by notify_phy_event(). So 
> this patch must be
> replaced by (untested):
>
> diff --git a/drivers/scsi/scsi_transport_sas.c 
> b/drivers/scsi/scsi_transport_sas.c
> index cdbb293..e1e6492 100644
> --- a/drivers/scsi/scsi_transport_sas.c
> +++ b/drivers/scsi/scsi_transport_sas.c
> @@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev)
>   */
>  void sas_remove_host(struct Scsi_Host *shost)
>  {
> +   scsi_flush_work(shost);
> sas_remove_children(&shost->shost_gendev);
>  }
>  EXPORT_SYMBOL(sas_remove_host);
>
> John, mind giving that one a shot in your test setup as well?
>
> Thanks,
> Johannes
>
> --
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Haha, I have same idea :)
Have no test env, so if John could test it, it will be great.
-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss


Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread Johannes Thumshirn
On Wed, Mar 29, 2017 at 12:53:28PM +0100, John Garry wrote:
> On 29/03/2017 12:29, Johannes Thumshirn wrote:
> >On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote:
> >>On 29/03/2017 10:41, Johannes Thumshirn wrote:
> >>>In the advent of an SAS device unregister we have to wait for all destruct
> >>>works to be done to not accidently delay deletion of a SAS rphy or it's
> >>>children to the point when we're removing the SCSI or SAS hosts.
> >>>
> >>>Signed-off-by: Johannes Thumshirn 
> >>>---
> >>>drivers/scsi/libsas/sas_discover.c | 4 
> >>>1 file changed, 4 insertions(+)
> >>>
> >>>diff --git a/drivers/scsi/libsas/sas_discover.c 
> >>>b/drivers/scsi/libsas/sas_discover.c
> >>>index 60de662..75b18f1 100644
> >>>--- a/drivers/scsi/libsas/sas_discover.c
> >>>+++ b/drivers/scsi/libsas/sas_discover.c
> >>>@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, 
> >>>struct domain_device *dev)
> >>>   }
> >>>
> >>>   if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
> >>>+  struct sas_discovery *disc = &dev->port->disc;
> >>>+  struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
> >>>+
> >>>   sas_rphy_unlink(dev->rphy);
> >>>   list_move_tail(&dev->disco_list_node, &port->destroy_list);
> >>>   sas_discover_event(dev->port, DISCE_DESTRUCT);
> >>>+  flush_work(&sw->work);
> >>
> >>I quickly tested plugging out the expander and we never get past this call
> >>to flush - a hang results:
> >
> >Can you activat lockdep so we can see which lock it is that we're blocking 
> >on?
> >
> 
> I have it on:
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_LOCKD=y
> CONFIG_LOCKD_V4=y
> 
> >It's most likely in sas_unregister_common_dev() but this function takes two 
> >spin
> >locks, port->dev_list_lock and ha->lock.
> >
> 
> We can see from the callstack I provided that we're working in workqueue
> scsi_wq_0 and trying to flush that same queue.

Aaahh, now I get what's happening (with some kicks^Whelp from Hannes I admit).

The sas_unregister_dev() comes from the work queued by notify_phy_event(). So 
this patch must be
replaced by (untested):

diff --git a/drivers/scsi/scsi_transport_sas.c 
b/drivers/scsi/scsi_transport_sas.c
index cdbb293..e1e6492 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -375,6 +375,7 @@ void sas_remove_children(struct device *dev)
  */
 void sas_remove_host(struct Scsi_Host *shost)
 {
+   scsi_flush_work(shost);
sas_remove_children(&shost->shost_gendev);
 }
 EXPORT_SYMBOL(sas_remove_host);

John, mind giving that one a shot in your test setup as well?

Thanks,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread John Garry

On 29/03/2017 12:29, Johannes Thumshirn wrote:

On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote:

On 29/03/2017 10:41, Johannes Thumshirn wrote:

In the advent of an SAS device unregister we have to wait for all destruct
works to be done to not accidently delay deletion of a SAS rphy or it's
children to the point when we're removing the SCSI or SAS hosts.

Signed-off-by: Johannes Thumshirn 
---
drivers/scsi/libsas/sas_discover.c | 4 
1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 60de662..75b18f1 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct 
domain_device *dev)
}

if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
+   struct sas_discovery *disc = &dev->port->disc;
+   struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
+
sas_rphy_unlink(dev->rphy);
list_move_tail(&dev->disco_list_node, &port->destroy_list);
sas_discover_event(dev->port, DISCE_DESTRUCT);
+   flush_work(&sw->work);


I quickly tested plugging out the expander and we never get past this call
to flush - a hang results:


Can you activat lockdep so we can see which lock it is that we're blocking on?



I have it on:
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y


It's most likely in sas_unregister_common_dev() but this function takes two spin
locks, port->dev_list_lock and ha->lock.



We can see from the callstack I provided that we're working in workqueue 
scsi_wq_0 and trying to flush that same queue.


Much appreciated,
John


Thanks a lot,
   Johannes






Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread Johannes Thumshirn
On Wed, Mar 29, 2017 at 12:15:44PM +0100, John Garry wrote:
> On 29/03/2017 10:41, Johannes Thumshirn wrote:
> >In the advent of an SAS device unregister we have to wait for all destruct
> >works to be done to not accidently delay deletion of a SAS rphy or it's
> >children to the point when we're removing the SCSI or SAS hosts.
> >
> >Signed-off-by: Johannes Thumshirn 
> >---
> > drivers/scsi/libsas/sas_discover.c | 4 
> > 1 file changed, 4 insertions(+)
> >
> >diff --git a/drivers/scsi/libsas/sas_discover.c 
> >b/drivers/scsi/libsas/sas_discover.c
> >index 60de662..75b18f1 100644
> >--- a/drivers/scsi/libsas/sas_discover.c
> >+++ b/drivers/scsi/libsas/sas_discover.c
> >@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, 
> >struct domain_device *dev)
> > }
> >
> > if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
> >+struct sas_discovery *disc = &dev->port->disc;
> >+struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
> >+
> > sas_rphy_unlink(dev->rphy);
> > list_move_tail(&dev->disco_list_node, &port->destroy_list);
> > sas_discover_event(dev->port, DISCE_DESTRUCT);
> >+flush_work(&sw->work);
> 
> I quickly tested plugging out the expander and we never get past this call
> to flush - a hang results:

Can you activat lockdep so we can see which lock it is that we're blocking on?

It's most likely in sas_unregister_common_dev() but this function takes two spin
locks, port->dev_list_lock and ha->lock.

Thanks a lot,
   Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread John Garry

On 29/03/2017 10:41, Johannes Thumshirn wrote:

In the advent of an SAS device unregister we have to wait for all destruct
works to be done to not accidently delay deletion of a SAS rphy or it's
children to the point when we're removing the SCSI or SAS hosts.

Signed-off-by: Johannes Thumshirn 
---
 drivers/scsi/libsas/sas_discover.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 60de662..75b18f1 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct 
domain_device *dev)
}

if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
+   struct sas_discovery *disc = &dev->port->disc;
+   struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
+
sas_rphy_unlink(dev->rphy);
list_move_tail(&dev->disco_list_node, &port->destroy_list);
sas_discover_event(dev->port, DISCE_DESTRUCT);
+   flush_work(&sw->work);


I quickly tested plugging out the expander and we never get past this 
call to flush - a hang results:


root@(none)$ [  243.357088] INFO: task kworker/u32:1:106 blocked for 
more than 120 seconds.

[  243.364030]   Not tainted 4.11.0-rc1-13687-g2562e6a-dirty #1388
[  243.370282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  243.378086] kworker/u32:1   D0   106  2 0x
[  243.383566] Workqueue: scsi_wq_0 sas_phye_loss_of_signal
[  243.388863] Call trace:
[  243.391314] [] __switch_to+0xa4/0xb0
[  243.396442] [] __schedule+0x1b4/0x5d0
[  243.401654] [] schedule+0x38/0x9c
[  243.406520] [] schedule_timeout+0x194/0x294
[  243.412249] [] wait_for_common+0xb0/0x144
[  243.417805] [] wait_for_completion+0x14/0x1c
[  243.423623] [] flush_work+0xe0/0x1a8
[  243.428747] [] sas_unregister_dev+0xf8/0x110
[  243.434563] [] sas_unregister_domain_devices+0x4c/0xc8
[  243.441242] [] sas_deform_port+0x14c/0x15c
[  243.446886] [] sas_phye_loss_of_signal+0x48/0x54
[  243.453048] [] process_one_work+0x138/0x2d8
[  243.458776] [] worker_thread+0x58/0x424
[  243.464161] [] kthread+0xf4/0x120
[  243.469024] [] ret_from_fork+0x10/0x50
[  364.189094] INFO: task kworker/u32:1:106 blocked for more than 120 
seconds.

[  364.196035]   Not tainted 4.11.0-rc1-13687-g2562e6a-dirty #1388
[  364.202281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  364.210085] kworker/u32:1   D0   106  2 0x
[  364.215558] Workqueue: scsi_wq_0 sas_phye_loss_of_signal
[  364.220855] Call trace:
[  364.223303] [] __switch_to+0xa4/0xb0
[  364.228428] [] __schedule+0x1b4/0x5d0
[  364.233640] [] schedule+0x38/0x9c
[  364.238506] [] schedule_timeout+0x194/0x294
[  364.244237] [] wait_for_common+0xb0/0x144
[  364.249793] [] wait_for_completion+0x14/0x1c
[  364.255610] [] flush_work+0xe0/0x1a8
[  364.260736] [] sas_unregister_dev+0xf8/0x110
[  364.266551] [] sas_unregister_domain_devices+0x4c/0xc8
[  364.273230] [] sas_deform_port+0x14c/0x15c
[  364.278872] [] sas_phye_loss_of_signal+0x48/0x54
[  364.285034] [] process_one_work+0x138/0x2d8
[  364.290763] [] worker_thread+0x58/0x424
[  364.296147] [] kthread+0xf4/0x120
[  364.301013] [] ret_from_fork+0x10/0x50

Is the issue that we are trying to flush the queue when we are working 
in the same queue context?


Thanks,
John


}
 }







[PATCH 1/2] scsi: sas: flush destruct workqueue on device unregister

2017-03-29 Thread Johannes Thumshirn
In the advent of an SAS device unregister we have to wait for all destruct
works to be done to not accidently delay deletion of a SAS rphy or it's
children to the point when we're removing the SCSI or SAS hosts.

Signed-off-by: Johannes Thumshirn 
---
 drivers/scsi/libsas/sas_discover.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 60de662..75b18f1 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -382,9 +382,13 @@ void sas_unregister_dev(struct asd_sas_port *port, struct 
domain_device *dev)
}
 
if (!test_and_set_bit(SAS_DEV_DESTROY, &dev->state)) {
+   struct sas_discovery *disc = &dev->port->disc;
+   struct sas_work *sw = &disc->disc_work[DISCE_DESTRUCT].work;
+
sas_rphy_unlink(dev->rphy);
list_move_tail(&dev->disco_list_node, &port->destroy_list);
sas_discover_event(dev->port, DISCE_DESTRUCT);
+   flush_work(&sw->work);
}
 }
 
-- 
1.8.5.6