> The lockdep report I obtained this morning with a 2.6.30.4 kernel and > the two patches applied has been attached to the kernel bugzilla > entry. This lockdep report was generated while testing the SRPT target > software. I have double checked that the SRPT target implementation > does not hold any spinlocks or mutexes while calling functions in the > IB core. This means that the SRPT target code cannot have caused any > of the reported lock cycles.
Lockdep is not quite so simple as what you checked, but yes, in this case it does appear to be pointing a real (albeit spectacularly unlikely) deadlock in the core IB stack: ib_cm takes cm_id_priv->lock and calls ib_post_send_mad() from there, ib_mad takes mad_agent_priv->lock in another context, ib_mad takes mad_agent_priv->lock and does cancel_delayed_work(&mad_agent_priv->timed_work) (and internally cancel_delayed_work() does del_timer_sync()) finally, in another context a communication established event can occur and generate a callback (in interrupt context) to ib_cm where it takes cm_id_priv->lock So there can be a chain that deadlocks: if the timer for the timed_work is running on a CPU, and the interrupt for the communication established event occurs while the timer is running, then that interrupt handler can try to take cm_id_priv->lock. However on another CPU, someone could already be holding cm_id_priv->lock and call into ib_post_send_mad(), and spinning on mad_agent_priv->lock, while on yet another CPU, someone could be holding mad_agent_priv->lock and doing cancel_delayed_work(). And that will deadlock waiting in del_timer_sync() since the timer has been interrupted by an interrupt handler that will spin on a spinlock that is part of this chain. I'm not sure what the right fix is. It does seem to me that this should be fixed within the ib_mad module, since doing del_timer_sync() within a spinlocked region seems like the fundamental problem. However I'm not sure what the best way to rewrite the ib_mad usage is. > By the way, I noticed that while many subsystems in the Linux kernel > use event queues to report information to higher software layers, that > the IB core makes extensive use of callback functions. The combination > of nested locking and callback functions can easily lead to lock > inversion. This effect is well known in the operating system world -- > see e.g. the talk by John Ousterhout about multithreaded versus > event-driven software (http://home.pacbell.net/ouster/threads.pdf, > 1996). I'm not sure what you mean by this. What would be an example of a subsystem that uses event queues to report information? I think the design of the RDMA stack is quite parallel to most other Linux subsystems, and we don't have anything as deadlock prone as, say, the network stack's rtnl. Trying to queue events up instead of calling back from interrupt context is not all that simple, since one cannot reliably allocate memory, and one must deal with synchonization with the consuming context etc. It's probably at least as deadlock-prone to try and queue as it is to just call back. Osterhout's talk certainly makes sense for a certain class of userspace apps, but he explicitly says that event driven programming only uses one CPU, and of course userspace doesn't have hard interrupt handlers or anything like that. So the kernel is more complex just because the environment it runs under is a little trickier than what the kernel provides for userspace. - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
