During IPMP test execution, we hit an interesting deadlock. It's a little
hard to explain without getting hip-deep in the details of the new
implementation, but basically:
T1: running a timeout (mld_timeout_handler()), waiting to enter
the IPSQ for bge3 via ipsq_enter().
T2: finishing an IPMP "group leave" operation on bge3.
Specifically, bge3 has been removed from the group, and the
IPMP code has asked the IPSQ code to switch bge3's IPSQ xop
from the group IPSQ to bge3's xop as part of exiting the IPSQ.
In this case, we must actually exit two xop's: the original
group xop, and bge3's xop (which we were implicitly also
inside when the "group leave" operation started, since bge3
was part of the group). To do this, ipsq_dq() calls
ipsq_exit() on the group xop once the group xop has been
drained. The catch here is that we're still exclusive on
bge3's xop when ipsq_exit() runs, and ipsq_exit() may try to
start the MLD/IGMP timers. If it does, it will cause a
deadlock with T1 because it will try to untimeout() T1, which
cannot complete until T1 completes, which cannot happen
until T2 exits bge3's xop.
I explored a few fixes, and the simplest seems to be refactor ipsq_exit()
into two functions, one of which (ipsq_drain()) doesn't start the timers.
That version is used from ipsq_dq(), which eliminates the possibility of
the deadlock. Note that the timers will still be started when bge3's
xop is exited. Please have a look. This is hairy stuff, so please ask
questions if it doesn't make sense.
http://zhadum.east.sun.com/ws/clearview/clearview-ipmpdev/webrev
--
meem