在 2017/7/14 0:10, John Garry 写道: > On 10/07/2017 08:06, Yijing Wang wrote: >> Disco mutex was introudced to prevent domain rediscovery competing >> with ata error handling(87c8331). If we have already hold the lock >> in sas_revalidate_domain and sync executing probe, deadlock caused, >> because, sas_probe_sata() also need hold disco_mutex. Since disco mutex >> use to prevent revalidata domain happen during ata error handler, >> it should be safe to release disco mutex when sync probe, because >> no new revalidate domain event would be process until the sync return, >> and the current sas revalidate domain finish. >> > > So with your changes we have a chain of synchronised events, running in > separate queues. In theory it sounds ok. > > Then, as you said in the commit message, sas_revalidate_domain() holds the > disco mutex while *all* these chained events occur; so we will continue to > hold the mutex until we have revalidated the domain, meaning until we have > finished destroying or probing new devices. > > But in the domain revalidation when you discover a new ATA device, function > sas_probe_sata() wants to grab the disco mutex and you just temporarily > release it, even though adding a new ATA device kicks in EH. This defeats the > principal of using a mutex at all, which is (according to 87c8331 commit > message) to mutually exclude the domain re-discovery (which has not actually > finished) and the ATA EH (and device destruction). > > Anyway, since we are synchronising this series of events (broadcast event, > domain rediscovery, and device destruction), surely it should be possible to > include the ATA EH as well, so we can actually get rid of the disco mutex > finally, right?
Yes, disco mutex make this issue complex, I checked the commit history, Dan introudce disco mutex and probe/destruct discovery event, so it seems to need a big rework to the libsas process logic, I am so sorry that I have no more time to deal with it, I will leave today, if you like, you could rework my patchset or add additional changes based this patchset. > > Note: I think that there is a problem which you have not seen. Consider > removing a ATA disk with IO active conncted to an expander: > - LLDD sends brodcast event > - sas_revalidate_domain(), which grabs disco mutex > - revalidate finds dev is gone > - destruct device, which calls sas_rphy_delete > - this waits on command queue to drain > - commands time out and EH thread kicks in > - sas_ata_strategy_handler() called > - domain revalidation disable attempted > - try to grab disco mutex = Deadlock. Yes, it's a issue I haven't found. Thanks! Yijing. Hi John, I also agree to rework disco mutex > > Thanks, > John > >> Signed-off-by: Yijing Wang <wangyij...@huawei.com> >> CC: John Garry <john.ga...@huawei.com> >> CC: Johannes Thumshirn <jthumsh...@suse.de> >> CC: Ewan Milne <emi...@redhat.com> >> CC: Christoph Hellwig <h...@lst.de> >> CC: Tomas Henzl <the...@redhat.com> >> CC: Dan Williams <dan.j.willi...@intel.com> >> --- >> drivers/scsi/libsas/sas_expander.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/drivers/scsi/libsas/sas_expander.c >> b/drivers/scsi/libsas/sas_expander.c >> index 9d26c28..077024e 100644 >> --- a/drivers/scsi/libsas/sas_expander.c >> +++ b/drivers/scsi/libsas/sas_expander.c >> @@ -776,6 +776,7 @@ static struct domain_device *sas_ex_discover_end_dev( >> struct ex_phy *phy = &parent_ex->ex_phy[phy_id]; >> struct domain_device *child = NULL; >> struct sas_rphy *rphy; >> + bool prev_lock; >> int res; >> >> if (phy->attached_sata_host || phy->attached_sata_ps) >> @@ -803,6 +804,7 @@ static struct domain_device *sas_ex_discover_end_dev( >> sas_ex_get_linkrate(parent, child, phy); >> sas_device_set_phy(child, phy->port); >> >> + prev_lock = mutex_is_locked(&child->port->ha->disco_mutex); >> #ifdef CONFIG_SCSI_SAS_ATA >> if ((phy->attached_tproto & SAS_PROTOCOL_STP) || >> phy->attached_sata_dev) { >> res = sas_get_ata_info(child, phy); >> @@ -832,7 +834,11 @@ static struct domain_device *sas_ex_discover_end_dev( >> SAS_ADDR(parent->sas_addr), phy_id, res); >> goto out_list_del; >> } >> + if (prev_lock) >> + mutex_unlock(&child->port->ha->disco_mutex); >> sas_disc_wait_completion(child->port, DISCE_PROBE); >> + if (prev_lock) >> + mutex_lock(&child->port->ha->disco_mutex); >> >> } else >> #endif >> @@ -861,7 +867,11 @@ static struct domain_device *sas_ex_discover_end_dev( >> SAS_ADDR(parent->sas_addr), phy_id, res); >> goto out_list_del; >> } >> + if (prev_lock) >> + mutex_unlock(&child->port->ha->disco_mutex); >> sas_disc_wait_completion(child->port, DISCE_PROBE); >> + if (prev_lock) >> + mutex_lock(&child->port->ha->disco_mutex); >> } else { >> SAS_DPRINTK("target proto 0x%x at %016llx:0x%x not handled\n", >> phy->attached_tproto, SAS_ADDR(parent->sas_addr), >> > > > > . >