Sean Hefty wrote:
I've read back over this description a few times, and I still don't fully grok the problem. Can you clarify if the following sequence is what's happening?

1. The node has joined the multicast group. Meaning that the SA has routed multicast traffic to the node. 2. You take down the link of the switch port that connects the node. Is this done via a program? 3. The port is brought back online. This generates a PORT_ACTIVE event, but the previous event was also PORT_ACTIVE.
4. ipoib leaves the group.
5. ipoib re-joins the group.
6. The multicast module isn't aware that any errors have occurred on the multicast group, so simply completes the join request at step 5 without SA involvement.

If I'm understanding this, somewhere in the above sequence the multicast routing to this node is lost. Either the SA removed the node from the group, or the switch lost its routing tables, or ...?

Indeed am taking the switch link down via a program.

Now, is this case there was --no-- previous event, when the port was brought back online there was PORT_ACTIVE event (its a driver issue which we look at). However, from the view point of the SA there was "GID out" event, so the HCA port was dropped out from the multicast group and the multicast routing (spanning tree, MFTs configuration etc) was computed without this port being included. This is the ipoib logging of what happens from its perspective (I have added the event number to the "port state change event" print):

ib0: Port state change event 9
ib0: Flushing ib0
ib0: flushing
ib0: downing ib_dev
ib0: stopping multicast thread
ib0: flushing multicast list
ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: starting multicast thread
ib0: restarting multicast task
ib0: stopping multicast thread
ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: starting multicast thread
ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0)
ib0: Created ah c504b7a0
ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV c504b7a0, LID 0xc000, SL 0
ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0)
ib0: Created ah c504ba20
ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV c504ba20, LID 0xc001, SL 0
ib0: successfully joined all multicast groups

I'm also trying to understand how the problem would apply to a different setup:

node 1 <-> switch A <-> switch B <-> switch C <-> SA

Suppose the same link down/up occurred between switch A and switch B. What happens to the multicast members to the left of switch B? Will node 1 see a PORT_ACTIVE event in this case as well?

The members of multicast group are only HCA ports. Indeed, join/leave requests of members cause the SA to trigger the SM to recompute the multicast routing, however, there are more causes, such as a port going down anywhere in the fabric, so if its an hca port it would be dropped from all the group it is member in, and if its a switch port, all the effected unicast AND multicast routing must be computed by the SM.

The host would only see port up/down events as of changes in the link state in the local port or in the port which is connected to it through the cable.

Or.

Or.


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to