Re: Lockdep problem: v3.18+ (with yesterday's Linus tip - 6f51ee709e4c)

2015-01-07 Thread Russell King - ARM Linux
On Wed, Jan 07, 2015 at 10:42:14AM -0500, Alan Stern wrote:
> On Thu, 18 Dec 2014, Greg Kroah-Hartman wrote:
> > That seems reasonable to me, unbinding when a reset is happening is
> > going to be a rare condition, but if we get rid of it, and we try to
> > queue a reset for a device that is gone, we will just fail the reset,
> > right?  If all should be fine, I have no objection to removing it.
> 
> Russell, can you reproduce that lockdep violation whenever you want?

I can certainly try it - I move the logitek receiver around between about
five machines depending on which I'm wanting to use.  Obviously, having
had the Christmas holidays recently, it hasn't been moved so much.

However, at the moment, I'm still not doing much in the way of kernel
work due to ongoing illness.

> If you can, does the following patch help?

I'll give it a go once I've worked out how reproducable it is, many
thanks for looking into this.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lockdep problem: v3.18+ (with yesterday's Linus tip - 6f51ee709e4c)

2015-01-07 Thread Alan Stern
On Thu, 18 Dec 2014, Greg Kroah-Hartman wrote:

> On Thu, Dec 18, 2014 at 02:40:48PM -0500, Alan Stern wrote:
> > On Thu, 18 Dec 2014, Russell King - ARM Linux wrote:
> > 
> > > While unplugging a Logitek Keyboard/mouse micro-receiver, I got the
> > > lockdep splat below.
> > > 
> > > However, I don't fully understand this splat - I see nothing in
> > > flush_work() nor process_one_work() making use of "intf->reset_ws" -
> > > which seems to be a USB thing.  I guess lockdep is being re-used to
> > > validate work stuff, and "lock" is just plain misleading.
> > 
> > That sounds right.  intf->reset_ws is a work_struct for a reset 
> > request.
> > 
> > > usb 2-1.1: USB disconnect, device number 3
> > > 
> > > =
> > > [ INFO: possible recursive locking detected ]
> > > 3.18.0+ #1459 Not tainted
> > > -
> > > kworker/0:1/2758 is trying to acquire lock:
> > >  ((&intf->reset_ws)){+.+.+.}, at: [] flush_work+0x0/0x264
> > > 
> > > but task is already holding lock:
> > >  ((&intf->reset_ws)){+.+.+.}, at: [] 
> > > process_one_work+0x130/0x4b4
> > 
> > I think what happened is that we called cancel_work_sync() for the 
> > reset_ws embedded in one intf structure while running in the workqueue 
> > routine for the reset_ws embedded in a different intf structure.
> > 
> > Assuming this interpretation is correct, I don't see how we can prevent 
> > lockdep from making this mistake.  Here's the problem:
> > 
> > An interface driver can queue a request for a reset.
> > 
> > A reset can cause interface drivers to be unbound from their
> > interfaces.
> > 
> > If an interface driver is unbound, any pending reset request 
> > that it queued has to be cancelled.  Otherwise the workqueue
> > would most likely end up carrying out the reset at a later time 
> > when nobody wants it.
> > 
> > (The code contains explicit protection for the case where the interface 
> > being unbound is the one whose reset request is currently in progress.)
> > 
> > Now perhaps we don't need that last step.  We could allow a queued
> > reset to take place even after the driver that requested it is gone.  
> > Most of the time these reset requests occur in response to I/O errors
> > when a USB device is unplugged -- as happened in this example -- in
> > which case it doesn't matter what we do.
> > 
> > So even though it's not entirely pleasant, my inclination is to solve
> > the problem by getting rid of usb_cancel_queued_reset() in
> > drivers/usb/core/driver.c.
> > 
> > Comments, anybody?
> 
> That seems reasonable to me, unbinding when a reset is happening is
> going to be a rare condition, but if we get rid of it, and we try to
> queue a reset for a device that is gone, we will just fail the reset,
> right?  If all should be fine, I have no objection to removing it.

Russell, can you reproduce that lockdep violation whenever you want?

If you can, does the following patch help?

Alan Stern

-

The USB stack provides a mechanism for drivers to request an
asynchronous device reset (usb_queue_reset_device()).  The mechanism
uses a work item (reset_ws) embedded in the usb_interface structure
used by the driver, and the reset is carried out by a work queue
routine.

The asynchronous reset can race with driver unbinding.  When this
happens, we try to cancel the queued reset before unbinding the
driver, on the theory that the driver won't care about any resets once
it is unbound.

However, thanks to the fact that lockdep now tracks work queue
accesses, this can provoke a lockdep warning in situations where the
device reset causes another interface's driver to be unbound; see

http://marc.info/?l=linux-usb&m=141893165203776&w=2

for an example.  The reason is that the work routine for reset_ws in
one interface calls cancel_queued_work() for the reset_ws in another
interface.  Lockdep thinks this might lead to a work routine trying to
cancel itself.  The simplest solution is not to cancel queued resets
when unbinding drivers.

This means we now need to acquire a reference to the usb_interface
when queuing a reset_ws work item and to drop the reference when the
work routine finishes.  We also need to make sure that the
usb_interface structure doesn't outlive its parent usb_device; this
means acquiring and dropping a reference when the interface is created
and destroyed.

Signed-off-by: Alan Stern 
Reported-by: Russell King - ARM Linux 

---

 drivers/usb/core/driver.c  |   17 -
 drivers/usb/core/hub.c |   25 +
 drivers/usb/core/message.c |   23 +++
 include/linux/usb.h|5 -
 4 files changed, 12 insertions(+), 58 deletions(-)

Index: usb-3.19/include/linux/usb.h
===
--- usb-3.19.orig/include/linux/usb.h
+++ usb-3.19/inclu

Re: Lockdep problem: v3.18+ (with yesterday's Linus tip - 6f51ee709e4c)

2014-12-18 Thread Greg Kroah-Hartman
On Thu, Dec 18, 2014 at 02:40:48PM -0500, Alan Stern wrote:
> On Thu, 18 Dec 2014, Russell King - ARM Linux wrote:
> 
> > While unplugging a Logitek Keyboard/mouse micro-receiver, I got the
> > lockdep splat below.
> > 
> > However, I don't fully understand this splat - I see nothing in
> > flush_work() nor process_one_work() making use of "intf->reset_ws" -
> > which seems to be a USB thing.  I guess lockdep is being re-used to
> > validate work stuff, and "lock" is just plain misleading.
> 
> That sounds right.  intf->reset_ws is a work_struct for a reset 
> request.
> 
> > usb 2-1.1: USB disconnect, device number 3
> > 
> > =
> > [ INFO: possible recursive locking detected ]
> > 3.18.0+ #1459 Not tainted
> > -
> > kworker/0:1/2758 is trying to acquire lock:
> >  ((&intf->reset_ws)){+.+.+.}, at: [] flush_work+0x0/0x264
> > 
> > but task is already holding lock:
> >  ((&intf->reset_ws)){+.+.+.}, at: [] process_one_work+0x130/0x4b4
> 
> I think what happened is that we called cancel_work_sync() for the 
> reset_ws embedded in one intf structure while running in the workqueue 
> routine for the reset_ws embedded in a different intf structure.
> 
> Assuming this interpretation is correct, I don't see how we can prevent 
> lockdep from making this mistake.  Here's the problem:
> 
>   An interface driver can queue a request for a reset.
> 
>   A reset can cause interface drivers to be unbound from their
>   interfaces.
> 
>   If an interface driver is unbound, any pending reset request 
>   that it queued has to be cancelled.  Otherwise the workqueue
>   would most likely end up carrying out the reset at a later time 
>   when nobody wants it.
> 
> (The code contains explicit protection for the case where the interface 
> being unbound is the one whose reset request is currently in progress.)
> 
> Now perhaps we don't need that last step.  We could allow a queued
> reset to take place even after the driver that requested it is gone.  
> Most of the time these reset requests occur in response to I/O errors
> when a USB device is unplugged -- as happened in this example -- in
> which case it doesn't matter what we do.
> 
> So even though it's not entirely pleasant, my inclination is to solve
> the problem by getting rid of usb_cancel_queued_reset() in
> drivers/usb/core/driver.c.
> 
> Comments, anybody?

That seems reasonable to me, unbinding when a reset is happening is
going to be a rare condition, but if we get rid of it, and we try to
queue a reset for a device that is gone, we will just fail the reset,
right?  If all should be fine, I have no objection to removing it.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lockdep problem: v3.18+ (with yesterday's Linus tip - 6f51ee709e4c)

2014-12-18 Thread Alan Stern
On Thu, 18 Dec 2014, Russell King - ARM Linux wrote:

> While unplugging a Logitek Keyboard/mouse micro-receiver, I got the
> lockdep splat below.
> 
> However, I don't fully understand this splat - I see nothing in
> flush_work() nor process_one_work() making use of "intf->reset_ws" -
> which seems to be a USB thing.  I guess lockdep is being re-used to
> validate work stuff, and "lock" is just plain misleading.

That sounds right.  intf->reset_ws is a work_struct for a reset 
request.

> usb 2-1.1: USB disconnect, device number 3
> 
> =
> [ INFO: possible recursive locking detected ]
> 3.18.0+ #1459 Not tainted
> -
> kworker/0:1/2758 is trying to acquire lock:
>  ((&intf->reset_ws)){+.+.+.}, at: [] flush_work+0x0/0x264
> 
> but task is already holding lock:
>  ((&intf->reset_ws)){+.+.+.}, at: [] process_one_work+0x130/0x4b4

I think what happened is that we called cancel_work_sync() for the 
reset_ws embedded in one intf structure while running in the workqueue 
routine for the reset_ws embedded in a different intf structure.

Assuming this interpretation is correct, I don't see how we can prevent 
lockdep from making this mistake.  Here's the problem:

An interface driver can queue a request for a reset.

A reset can cause interface drivers to be unbound from their
interfaces.

If an interface driver is unbound, any pending reset request 
that it queued has to be cancelled.  Otherwise the workqueue
would most likely end up carrying out the reset at a later time 
when nobody wants it.

(The code contains explicit protection for the case where the interface 
being unbound is the one whose reset request is currently in progress.)

Now perhaps we don't need that last step.  We could allow a queued
reset to take place even after the driver that requested it is gone.  
Most of the time these reset requests occur in response to I/O errors
when a USB device is unplugged -- as happened in this example -- in
which case it doesn't matter what we do.

So even though it's not entirely pleasant, my inclination is to solve
the problem by getting rid of usb_cancel_queued_reset() in
drivers/usb/core/driver.c.

Comments, anybody?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html