Hello,
I've been testing my LAG implemented with the DPDK eth_bond pmd. As part of my
fault tolerance testing, I want to ensure that if a link is flapping up and
down continuously, impact to service is minimal. My findings are that in this
case, the lag is rendered inoperable if a certain link is flapping. Details
below.
Setup:
- 4x10G X710 links in a 8023ad lag connected to a switch.
- Under normal operations, lag is steady, traffic balanced, etc
Problem:
If I take down a link on the switch corresponding to the "aggregator" link in
the dpdk lag, then bring it back up, every link in the lag goes from
distributing to not distributing to back to distributing. This causes
unnecessary loss of service.
A single link failure, regardless of whether or not it's the aggregator link,
should not change the state of the other links. Consider what would happen if
there were a hardware fault on that link, or its signal were bad: it's possible
for it to be stuck flapping up and down. This would lead to complete loss of
service on the lag, despite there being three stable links remaining.
Analysis:
- The switch is showing that the system id is changing when the link flaps.
It's going from 00:00:00:00:00:00 to the aggregator's mac. This is not good.
Why is it happening? It's because by default we seem to be using the
"AGG_BANDWIDTH" selection algorithm, which is broken: It's taking a slave
index, and using that the index into the 8023ad ports array, which is based on
the dpdk port number. It should translate it from the slave index into a
dpdk_port number using the slaves[] array.
- Aside from the above, if you look, the default is supposed to be AGG_STABLE,
according to bond_mode_8023ad_conf_get_default. However,
bond_mode_8023ad_conf_assign does not actually copy out the selection
algorithm, so it just uses 0, which happens to be AGG_BANDWIDTH.
- I fixed the above, but still faced two more issues:
1) The system ID changes when the aggregator changes, which can lead to the
problem.
2) When the link fails, it is "deactivated" in the lag via
bond_mode_8023ad_deactivate_slave. There is a block in there dedicated to the
case where the aggregator is disabled. In that case, it explicitly unselects
each slave sharing that aggregator. This causes
them to fall back to the DETACHED state in the mux machine -- i.e. they
are no longer aggregating at all, until the state machine runs through the LACP
exchange with the partner again.
Possible fix:
1) Change bond_mode_8023ad_conf_assign to actually copy out the selection
algorithm.
2) Ensure that all members of a LAG have the same system id (i.e. choose the
LAG's mac address)
3) Do not detach the other members when the aggregator's link state goes down.
Note:
1) We should fix AGG_BANDWIDTH and AGG_COUNT separately.
2) I can't see any reason why the system id should be equal to the mac of the
aggregator. It's intended to represent the system to which the lag belongs, not
the aggregator itself. The aggregator is represented by the operational key.
So, it should be fine to use the LAG's mac address, which is fixed at init, as
the system id for all possible aggregators.
3) I think not detaching is the correct approach. There is nothing in my
reading of 802.1Q or 802.1AX' LACP specification that implies we should do
this. There is a blurb about changes in parameters which lead to the change in
aggregator forcing the unselected
transition, but I don't think that needs to apply here. I'm fairly certain
they're talking about changing the operational key/etc.
How does everyone feel about this? Am I crazy in requiring this functionality?
What about the proposed fix. Does it sound reasonable, or am I going to break
the state machine somehow?
Thanks,
Kyle