Sean Hefty wrote:
I've read back over this description a few times, and I still don't
fully grok the problem. Can you clarify if the following sequence is
what's happening?
1. The node has joined the multicast group. Meaning that the SA has
routed multicast traffic to the node.
2. You take down the link of the switch port that connects the node. Is
this done via a program?
3. The port is brought back online. This generates a PORT_ACTIVE event,
but the previous event was also PORT_ACTIVE.
4. ipoib leaves the group.
5. ipoib re-joins the group.
6. The multicast module isn't aware that any errors have occurred on the
multicast group, so simply completes the join request at step 5 without
SA involvement.
If I'm understanding this, somewhere in the above sequence the multicast
routing to this node is lost. Either the SA removed the node from the
group, or the switch lost its routing tables, or ...?
Indeed am taking the switch link down via a program.
Now, is this case there was --no-- previous event, when the port was
brought back online there was PORT_ACTIVE event (its a driver issue
which we look at). However, from the view point of the SA there was "GID
out" event, so the HCA port was dropped out from the multicast group and
the multicast routing (spanning tree, MFTs configuration etc) was
computed without this port being included. This is the ipoib logging of
what happens from its perspective (I have added the event number to the
"port state change event" print):
ib0: Port state change event 9
ib0: Flushing ib0
ib0: flushing
ib0: downing ib_dev
ib0: stopping multicast thread
ib0: flushing multicast list
ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: starting multicast thread
ib0: restarting multicast task
ib0: stopping multicast thread
ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: starting multicast thread
ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0)
ib0: Created ah c504b7a0
ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV c504b7a0, LID 0xc000, SL 0
ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0)
ib0: Created ah c504ba20
ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV c504ba20, LID 0xc001, SL 0
ib0: successfully joined all multicast groups
I'm also trying to understand how the problem would apply to a different
setup:
node 1 <-> switch A <-> switch B <-> switch C <-> SA
Suppose the same link down/up occurred between switch A and switch B.
What happens to the multicast members to the left of switch B? Will
node 1 see a PORT_ACTIVE event in this case as well?
The members of multicast group are only HCA ports. Indeed, join/leave
requests of members cause the SA to trigger the SM to recompute the
multicast routing, however, there are more causes, such as a port going
down anywhere in the fabric, so if its an hca port it would be dropped
from all the group it is member in, and if its a switch port, all the
effected unicast AND multicast routing must be computed by the SM.
The host would only see port up/down events as of changes in the link
state in the local port or in the port which is connected to it through
the cable.
Or.
Or.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general