During IPMP test execution, we hit an interesting deadlock.  It's a little
hard to explain without getting hip-deep in the details of the new
implementation, but basically:

        T1: running a timeout (mld_timeout_handler()), waiting to enter
            the IPSQ for bge3 via ipsq_enter().

        T2: finishing an IPMP "group leave" operation on bge3.

            Specifically, bge3 has been removed from the group, and the
            IPMP code has asked the IPSQ code to switch bge3's IPSQ xop
            from the group IPSQ to bge3's xop as part of exiting the IPSQ.
            In this case, we must actually exit two xop's: the original
            group xop, and bge3's xop (which we were implicitly also
            inside when the "group leave" operation started, since bge3
            was part of the group).  To do this, ipsq_dq() calls
            ipsq_exit() on the group xop once the group xop has been
            drained.  The catch here is that we're still exclusive on
            bge3's xop when ipsq_exit() runs, and ipsq_exit() may try to
            start the MLD/IGMP timers.  If it does, it will cause a
            deadlock with T1 because it will try to untimeout() T1, which
            cannot complete until T1 completes, which cannot happen
            until T2 exits bge3's xop.

I explored a few fixes, and the simplest seems to be refactor ipsq_exit()
into two functions, one of which (ipsq_drain()) doesn't start the timers.
That version is used from ipsq_dq(), which eliminates the possibility of
the deadlock.  Note that the timers will still be started when bge3's
xop is exited.  Please have a look.  This is hairy stuff, so please ask
questions if it doesn't make sense.

  http://zhadum.east.sun.com/ws/clearview/clearview-ipmpdev/webrev

-- 
meem

Reply via email to