在 2017/7/14 0:10, John Garry 写道:
> On 10/07/2017 08:06, Yijing Wang wrote:
>> Disco mutex was introudced to prevent domain rediscovery competing
>> with ata error handling(87c8331). If we have already hold the lock
>> in sas_revalidate_domain and sync executing probe, deadlock caused,
>> because, sas_probe_sata() also need hold disco_mutex. Since disco mutex
>> use to prevent revalidata domain happen during ata error handler,
>> it should be safe to release disco mutex when sync probe, because
>> no new revalidate domain event would be process until the sync return,
>> and the current sas revalidate domain finish.
>>
> 
> So with your changes we have a chain of synchronised events, running in 
> separate queues. In theory it sounds ok.
> 
> Then, as you said in the commit message, sas_revalidate_domain() holds the 
> disco mutex while *all* these chained events occur; so we will continue to 
> hold the mutex until we have revalidated the domain, meaning until we have 
> finished destroying or probing new devices.
> 
> But in the domain revalidation when you discover a new ATA device, function 
> sas_probe_sata() wants to grab the disco mutex and you just temporarily 
> release it, even though adding a new ATA device kicks in EH. This defeats the 
> principal of using a mutex at all, which is (according to 87c8331 commit 
> message) to mutually exclude the domain re-discovery (which has not actually 
> finished) and the ATA EH (and device destruction).
> 
> Anyway, since we are synchronising this series of events (broadcast event, 
> domain rediscovery, and device destruction), surely it should be possible to 
> include the ATA EH as well, so we can actually get rid of the disco mutex 
> finally, right?

Yes, disco mutex make this issue complex, I checked the commit history, Dan 
introudce disco mutex and probe/destruct discovery event, so it seems to
need a big rework to the libsas process logic, I am so sorry that I have no 
more time to deal with it, I will leave today, if you like, you could
rework my patchset or add additional changes based this patchset.



> 
> Note: I think that there is a problem which you have not seen. Consider 
> removing a ATA disk with IO active conncted to an expander:
> - LLDD sends brodcast event
> - sas_revalidate_domain(), which grabs disco mutex
> - revalidate finds dev is gone
> - destruct device, which calls sas_rphy_delete
>     - this waits on command queue to drain
> - commands time out and EH thread kicks in
>     - sas_ata_strategy_handler() called
>     - domain revalidation disable attempted
>         - try to grab disco mutex = Deadlock.

Yes, it's a issue I haven't found.


Thanks!
Yijing.




Hi John, I also agree to rework disco mutex


> 
> Thanks,
> John
> 
>> Signed-off-by: Yijing Wang <wangyij...@huawei.com>
>> CC: John Garry <john.ga...@huawei.com>
>> CC: Johannes Thumshirn <jthumsh...@suse.de>
>> CC: Ewan Milne <emi...@redhat.com>
>> CC: Christoph Hellwig <h...@lst.de>
>> CC: Tomas Henzl <the...@redhat.com>
>> CC: Dan Williams <dan.j.willi...@intel.com>
>> ---
>>  drivers/scsi/libsas/sas_expander.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_expander.c 
>> b/drivers/scsi/libsas/sas_expander.c
>> index 9d26c28..077024e 100644
>> --- a/drivers/scsi/libsas/sas_expander.c
>> +++ b/drivers/scsi/libsas/sas_expander.c
>> @@ -776,6 +776,7 @@ static struct domain_device *sas_ex_discover_end_dev(
>>      struct ex_phy *phy = &parent_ex->ex_phy[phy_id];
>>      struct domain_device *child = NULL;
>>      struct sas_rphy *rphy;
>> +    bool prev_lock;
>>      int res;
>>
>>      if (phy->attached_sata_host || phy->attached_sata_ps)
>> @@ -803,6 +804,7 @@ static struct domain_device *sas_ex_discover_end_dev(
>>      sas_ex_get_linkrate(parent, child, phy);
>>      sas_device_set_phy(child, phy->port);
>>
>> +    prev_lock = mutex_is_locked(&child->port->ha->disco_mutex);
>>  #ifdef CONFIG_SCSI_SAS_ATA
>>      if ((phy->attached_tproto & SAS_PROTOCOL_STP) || 
>> phy->attached_sata_dev) {
>>          res = sas_get_ata_info(child, phy);
>> @@ -832,7 +834,11 @@ static struct domain_device *sas_ex_discover_end_dev(
>>                      SAS_ADDR(parent->sas_addr), phy_id, res);
>>              goto out_list_del;
>>          }
>> +        if (prev_lock)
>> +            mutex_unlock(&child->port->ha->disco_mutex);
>>          sas_disc_wait_completion(child->port, DISCE_PROBE);
>> +        if (prev_lock)
>> +            mutex_lock(&child->port->ha->disco_mutex);
>>
>>      } else
>>  #endif
>> @@ -861,7 +867,11 @@ static struct domain_device *sas_ex_discover_end_dev(
>>                      SAS_ADDR(parent->sas_addr), phy_id, res);
>>              goto out_list_del;
>>          }
>> +        if (prev_lock)
>> +            mutex_unlock(&child->port->ha->disco_mutex);
>>          sas_disc_wait_completion(child->port, DISCE_PROBE);
>> +        if (prev_lock)
>> +            mutex_lock(&child->port->ha->disco_mutex);
>>      } else {
>>          SAS_DPRINTK("target proto 0x%x at %016llx:0x%x not handled\n",
>>                  phy->attached_tproto, SAS_ADDR(parent->sas_addr),
>>
> 
> 
> 
> .
> 

Reply via email to