On Jun 23, 2011, at 11:18 PM, Scott Long wrote: > > On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote: > >> On 6/22/11 4:09 PM, Kenneth D. Merry wrote: >>> On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: >>>> On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: >>>>> These two are interesting: >>>>> >>>>>> http://img825.imageshack.us/img825/1249/21062011014m.jpg >>>>>> http://img839.imageshack.us/img839/3791/21062011015.jpg >>>>> >>>>> It looks like the GEOM event thread is stuck inside the cd(4) >> driver. The >>>>> cd(4) driver is trying to acquire the peripheral lock, and is sleeping >>>>> until it gets it. >>>>> >>>>> What isn't clear is who is holding it. >> >> ... >> >>> The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So >>> that tells me that one of several things is going on: >>> >>> - There is a path in the cd(4) driver where it can call cam_periph_hold() >>> but not cam_periph_unhold(). >>> >>> - There is another thread in the system that has called cam_periph_hold(), >>> and has gotten stuck before it can call cam_periph_unhold(). >>> >>> - The hold/unhold logic is broken, and there is a case where a thread >>> waiting for the lock can miss the wakeup. After looking at the code, I >>> don't think this is the case, but I may have missed something. >>> >>> So it is probably one of the first two cases. >> >> ... >> >> I have a theory for the cause of this hang. >> >> The commit that triggers this problem added calls to g_access() during the >> geom_dev probe. I believe this hit a race in cdregister() where >> the periph hold lock is dropped around the changer probe code. Why the >> periph hold lock is dropped there, I do not know as I haven't fully >> reviewed the changer probe code. >> > > Are you talking about this? > > disk_create(softc->disk, DISK_VERSION); > cam_periph_lock(periph); > cam_periph_unhold(periph); > [...] > if (((cgd->ccb_h.target_lun > 0) > && ((softc->quirks & CD_Q_NO_CHANGER) == 0)) > || ((softc->quirks & CD_Q_CHANGER) != 0)) { > > The unhold there compliments the hold that was done prior to disk_create(). > The hold/unhold is done as a hack around the need to drop the periph/sim > mutex while calling disk_create(), due to the later's insistence on using > blocking calls. I've wanted to re-think how that pattern is done (it's the > same gross hack in nearly all of the periph drivers), but haven't gotten > around to it. If the 'hold' semaphore needs to be held longer to prevent the > race that you're theorizing, then it should be possible to simply extend its > coverage in the code block, but I'm not sure if it'll result in an unintended > deadlock with the changer enumeration/matching code. I _think_ that it'll be > ok, but the density of magic in the code is a bit overwhelming at this time > of night =-) >
Actually, what's probably happening is that the sim/periph lock is being dropped but the hold semaphore is held, disk_create() is called, which kicks off GEOM to do GEOM-ish things including the new g_access() call. It tries to call cdopen(), which grabs the periph/sim mutex in order to then get the hold semaphore, and winds up sleeping because the semaphore is already held in cdregister(). If/when disk_create() returns, cdregister() winds up waiting for the periph lock because it's held by cdopen(), and now you have a deadlock between the two. Again, this is my fault for being lazy and not re-organzing the periph drivers so that disk_create() is safely called without shady locking hacks. I don't think that extending the coverage of the hold semaphore is going to help in this case. The periphs need to adopt a new pattern where disk_create() isn't called until the periph is completely initialized and ready for periph_open() to be called without any locking gymnastics. Scott _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"