We saw the bug on a uni processor system running the ipath driver, where the
consumer is ipoib and the group being the IPv4 broadcast. When we take down
the link of the switch port connected to the device across the cable, ipoib
rushes to leave the group and then join it. On this system the join "crosses
the leave" and the SA does not take into account the node when computing the
multicast routing of the group --> the node does not get the broadcast traffic.

I've read back over this description a few times, and I still don't fully grok the problem. Can you clarify if the following sequence is what's happening?

1. The node has joined the multicast group. Meaning that the SA has routed multicast traffic to the node. 2. You take down the link of the switch port that connects the node. Is this done via a program? 3. The port is brought back online. This generates a PORT_ACTIVE event, but the previous event was also PORT_ACTIVE.
4. ipoib leaves the group.
5. ipoib re-joins the group.
6. The multicast module isn't aware that any errors have occurred on the multicast group, so simply completes the join request at step 5 without SA involvement.

If I'm understanding this, somewhere in the above sequence the multicast routing to this node is lost. Either the SA removed the node from the group, or the switch lost its routing tables, or ...?

I'm also trying to understand how the problem would apply to a different setup:

node 1 <-> switch A <-> switch B <-> switch C <-> SA

Suppose the same link down/up occurred between switch A and switch B. What happens to the multicast members to the left of switch B? Will node 1 see a PORT_ACTIVE event in this case as well?

- Sean
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to