On Mon, 3 May 2004, David Brownell wrote:
If all the locking obeys the same (correct, and presumably top-down) locking protocol, then deadlocks won't happen. That'd include the code doing the reset ... which would see -ENODEV faults starting at some point, and then fail cleanly.
No, you've missed a key point. Let's say disconnect and reset happen at nearly the same time -- not unlikely, since a user frustrated at the errors that the reset is trying to correct might very well unplug the device. khubd locks the hub, then the device, then calls the driver's
That was the example wherein I made that typo .. :)
Other than locking the hub of the device being disconnected, none of that should be happening before the tree of devices being unplugged is marked as USB_STATE_NOTATTACHED. (Often a degenerate tree: one device.)
disconnect(). Meanwhile the error handler part of the driver has called usb_reset_device() which is trying to lock the hub.
Whatever any driver tries to do with a USB_STATE_NOTATTACHED device should be producing -ENODEV responses. We could even extend that to acquiring its lock ... though I expect that should be necessary.
Your strictly top-down analysis would say: Okay, khubd will proceed, and when it's done the device reset will proceed and fail with -ENODEV or -ESHUTDOWN. But that's wrong -- khubd can't proceed because the driver _mustn't_ return from disconnect() until it is finished using the device, i.e, until usb_reset_device() returns. (In the case of usb-storage it _can't_ return; scsi_unregister_host() will block until the error-handler thread has finished with the device.)
That's not _my_ analysis ... :)
One or the other task will lock the hub first. Simple case: khubd wins, driver's top-down lock acquisition will first block (because it can't get past khubd) and then later fail (the device is gone, though that task has an old device pointer that it's still got to release).
The not-as-simple case is that the reset() gets in first, and then khubd marks the device as NOTATTACHED before it blocks (on the reset task). At some point the driver notices (starts getting -ENODEV), stops, gives up, drops its lock, then khubd finishes.
This disconnect() issue is a parallel of the open()/disconnect() issue. In both cases, there's state that must linger after disconnect() returns, and be cleaned up later. In one case it's what close() accesses, and it's associated with a user file handle. In the other case, it's what the SCSI EH task will have to work with as it's noticing -ENODEV.
What's needed is a way for usb_reset_device() to try to acquire the hub's
serialize semaphore in such a way that it can be interrupted and fail if
and when the device is marked DISCONNECTED. Normal semaphores won't
permit something like that. (I don't think sending a signal and using down_interruptible() is a good approach.)
Erm, I thought I'd described several parts of exactly that. The basic version: locktree() failing if the device has actually been removed from the tree by then. (Possibly khubd could do that bit rather early ...) The slightly fancier version, checking for NOTATTACHED as it acquires locks, just fails a wee bit faster. Or as above, later requests failing because state is NOTATTACHED.
// internal to usbcore void change_device_state(struct usb_device *dev, enum usb_device_state state) { static spinlock_t lock = SPIN_LOCK_UNLOCKED; unsigned long flags;
spin_lock_irqsave(&lock, flags); if (dev->state != USB_STATE_UNATTACHED) { dev->state = state; /* hey, why not recurse right here? */ } spin_unlock_irqrestore(&lock, flags); }
I'm not quite sure what your concern is with children[].
Same as before. If we recurse right there, we end up calling
change_device_state(dev->children[i], USB_STATE_DISCONNECTED);
How do we know that between the time we make that call and the time the
inner invocation looks at (dev->children[i])->state, the child device
structure hasn't been deallocated? Remember, we aren't allowed to acquire a topology semaphore.
What I've been talking about is a scheme where hub->serialize serves as the (physical) topology lock for that subtree ... and everyone uses the same simple locking convention. I'd require whoever changes any child's state to NOTATTACHED to hold that lock, then later to actually get rid of the device object.
The "why not recurse" stage would be an optimization, and you could be right that it's not easily worked.
- Dave
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel