On 29.03.2025 01:41, Ilya Maximets wrote: On 3/28/25 11:00, Vladislav Odintsov wrote:
On 26.03.2025 15:30, Ilya Maximets wrote: On 3/25/25 00:21, Vladislav Odintsov wrote: With this commit the rebalance of hash entries between bonding members becomes less frequent. If we know all bond members' speed, we do not move hash entries between them if load difference is less than 1.5% of minimum link speed of bond members. Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html Suggested-by: Ilya Maximets <[email protected]><mailto:[email protected]> Signed-off-by: Vladislav Odintsov <[email protected]><mailto:[email protected]> --- v1 -> v2: - Addressed Ilya's, Eelco's, Mike's review comments. - Docs updated. - NEWS entry added. --- Documentation/topics/bonding.rst | 3 ++- NEWS | 4 ++++ ofproto/bond.c | 25 +++++++++++++++++++------ 3 files changed, 25 insertions(+), 7 deletions(-) Hi, Vladislav. Thanks for v2! Have you seen my reply for your observation about comparing speed to raw byte counters here: https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422255.html Tl;DR; I think you're right and we need to change the way we count the load in order to be able to compare the load to a link speed. Best regards, Ilya Maximets. Hi Ilya, Thanks for review! I changed the condition before calling a bond_shift_load() function, so now Bps with Bps instead of B with Bps should be compared. It seems to me that this is good enough. Do I understand you correctly, that you proposed to change the movement logic inside bond_shift_load() function? If yes, I'm wondering why do we care, whether we compare total carried bytes during last rebalance interval or an average bytes per second during the same interval? In both cases we seem to fairly compare same units: B vs B or Bps vs Bps, so the current logic of this function looks correct for me. And in both cases the result should be the same. Am I missing something? Hmm. It's complicated. Yes, you're comparing apples to apples here, i.e. this version compares speed to speed. The problem, I think, is that our total amount of bytes divided by the rebalance interval doesn't give a correct or accurate enough approximation of the average speed. Let's say we have a constant 100Bps traffic on the port. And let's say we have a standard 10sec rebalance interval. After the first 10 seconds, our tx_bytes on this port will be 1000B. Then we'll divide it by 2 for our ad-hoc exponentially weighted moving average implementation. So, it becomes 500B. After the next interval we'll have another 1000B plus our 500B from the previous interval, so 1500B in total. That, divided by the rebalancing interval, gives us 150Bps "average speed", which is 50% higher than the actual average speed on this interface. Next time it will be 175, and so on. In a long run this is 100 * (1 + 1/2 + 1/4 + ...) = 200, which 2x of our actual average speed. So, on a constant packet rate, this way of measuring is not accurate at all. I'm not doing the math for the non-constant speed, but I'd guess it will not be accurate enough either, even if it may average better in some scenarios it might be much worse in others. And I'm not sure if we can make this style of calculation accurate enough in all cases by just choosing a different divisor. All in all, I think, we need to change the way we calculate the load in order to have a more or less accurate idea about the average speed. The Exponential Moving Average from lib/mov-avg.h should give us much more accurate result. The problem, however, is that we can't easily shift the load if we calculate the average on a sum. So, we'll need to calculate an average speed per hash and then compare sums of averages with the link speed. Does that make sense? Best regards, Ilya Maximets. Hi, Ilya. I need to clarify, do we divide each hash by 2 to achieve next points: 1. have a monotonically decreasing tx_bytes to not to overload uint64_t; 2. at the same time "save" the knowledge of previous load in order to avoid frequent unnecessary shifts. Is that correct? Also, I'm not sure I fully understood your proposed approach... Do you want to switch from tx_bytes as a cumulative metric to a new "avg tx per second" metric? I guess this can be done approximately in the next order: 1. calculate the avg speed per second per each hash (tx_bytes / rebalance_interval) on rebalance run; 2. store this average speed inside hash_entry; 3. summarize hash_entries speed in "member average speed"; 4. do all calculations in bond_rebalance() in "avg bps" manner instead of "total bytes": here we can use same algorithm in bond_shift_load just re-orienting it to bps variable. Something like this (note: diff is based on current patch; not tested, only as an example of idea): diff --git a/ofproto/bond.c b/ofproto/bond.c index e2cbdd5ec..119baef4e 100644 --- a/ofproto/bond.c +++ b/ofproto/bond.c @@ -63,6 +63,8 @@ struct bond_entry { struct bond_member *member; /* Assigned member, NULL if unassigned. */ uint64_t tx_bytes /* Count of bytes recently transmitted. */ OVS_GUARDED_BY(rwlock); + /* Average bytes per second recently transmitted. */ + uint64_t tx_bps_avg OVS_GUARDED_BY(rwlock); struct ovs_list list_node; /* In bond_member's 'entries' list. */ /* Recirculation. @@ -95,7 +97,7 @@ struct bond_member { /* Rebalancing info. Used only by bond_rebalance(). */ struct ovs_list bal_node; /* In bond_rebalance()'s 'bals' list. */ struct ovs_list entries; /* 'struct bond_entry's assigned here. */ - uint64_t tx_bytes; /* Sum across 'tx_bytes' of entries. */ + uint64_t tx_bps_avg; /* Sum across 'tx_bps_avg' of entries. */ }; /* A bond, that is, a set of network devices grouped to improve performance or @@ -1163,8 +1165,8 @@ log_bals(struct bond *bond, const struct ovs_list *bals) if (ds.length) { ds_put_char(&ds, ','); } - ds_put_format(&ds, " %s %"PRIu64"kB", - member->name, member->tx_bytes / 1024); + ds_put_format(&ds, " %s %"PRIu64"kB/s", + member->name, member->tx_bps_avg / 1024); if (!member->enabled) { ds_put_cstr(&ds, " (disabled)"); @@ -1195,19 +1197,19 @@ bond_shift_load(struct bond_entry *hash, struct bond_member *to) { struct bond_member *from = hash->member; struct bond *bond = from->bond; - uint64_t delta = hash->tx_bytes; + uint64_t delta = hash->tx_bps_avg; VLOG_DBG("bond %s: shift %"PRIu64"kB of load (with hash %"PRIdPTR") " "from %s to %s (now carrying %"PRIu64"kB and " "%"PRIu64"kB load, respectively)", bond->name, delta / 1024, hash - bond->hash, from->name, to->name, - (from->tx_bytes - delta) / 1024, - (to->tx_bytes + delta) / 1024); + (from->tx_bps_avg - delta) / 1024, + (to->tx_bps_avg + delta) / 1024); /* Shift load away from 'from' to 'to'. */ - from->tx_bytes -= delta; - to->tx_bytes += delta; + from->tx_bps_avg -= delta; + to->tx_bps_avg += delta; /* Arrange for flows to be revalidated. */ hash->member = to; @@ -1222,15 +1224,16 @@ bond_shift_load(struct bond_entry *hash, struct bond_member *to) * The list of entries is sorted in descending order of load. This allows us * to collect subset of entries with accumulated load close to ideal. */ static size_t -choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes, +choose_entries_to_migrate(const struct bond_member *from, + uint64_t to_tx_bps_avg, struct bond_entry **to_migrate) OVS_REQ_WRLOCK(rwlock) { struct bond_entry *e; /* Note, the ideal traffic is the mid point between 'from' and 'to'. * This value does not change by rebalancing. */ - uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 2; - uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes; + uint64_t ideal_tx_bps_avg = (from->tx_bps_avg + to_tx_bps_avg) / 2; + uint64_t ideal_delta = ideal_tx_bps_avg - to_tx_bps_avg; uint64_t delta = 0; /* The amount to re balance. */ uint64_t new_low; /* The lower bandwidth between 'to' and 'from' * after rebalancing. */ @@ -1244,10 +1247,10 @@ choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes, } LIST_FOR_EACH (e, list_node, &from->entries) { - if (delta + e->tx_bytes <= ideal_delta) { + if (delta + e->tx_bps_avg <= ideal_delta) { /* Take next entry if amount to rebalance will not exceed ideal. */ to_migrate[cnt++] = e; - delta += e->tx_bytes; + delta += e->tx_bps_avg; } if (ideal_delta - delta < migration_threshold) { /* Stop collecting hashes if we're close enough to the ideal value @@ -1263,13 +1266,13 @@ choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes, ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), list_node); - delta = closest->tx_bytes; + delta = closest->tx_bps_avg; to_migrate[cnt++] = closest; } - new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta); - if ((new_low > to_tx_bytes) && - (new_low - to_tx_bytes >= migrati on_threshold)) { + new_low = MIN(from->tx_bps_avg - delta, to_tx_bps_avg + delta); + if ((new_low > to_tx_bps_avg) && + (new_low - to_tx_bps_avg >= migration_threshold)) { /* Only rebalance if the new 'low' is closer to to the mid point and the * improvement of traffic deviation from the ideal split exceeds 10% * (migration threshold). @@ -1290,7 +1293,7 @@ insert_bal(struct ovs_list *bals, struct bond_member *member) struct bond_member *pos; LIST_FOR_EACH (pos, bal_node, bals) { - if (member->tx_bytes > pos->tx_bytes) { + if (member->tx_bps_avg > pos->tx_bps_avg) { break; } } @@ -1356,7 +1359,7 @@ bond_rebalance(struct bond *bond) /* Add each bond_entry to its member's 'entries' list. * Compute each member's tx_bytes as the sum of its entries' tx_bytes. */ HMAP_FOR_EACH (member, hmap_node, &bond->members) { - member->tx_bytes = 0; + member->tx_bps_avg = 0; ovs_list_init(&member->entries); } @@ -1368,10 +1371,12 @@ bond_rebalance(struct bond *bond) /* Iteration over sorted bond hash es will give us sorted 'entries'. */ for (int i = 0; i < BOND_BUCKETS; i++) { e = hashes[i]; + uint64_t tx_bps_avg = e->tx_bytes / rebalance_interval_sec; if (e->member && e->tx_bytes) { - e->member->tx_bytes += e->tx_bytes; + e->member->tx_bps_avg += tx_bps_avg + e->tx_bps_avg; ovs_list_push_back(&e->member->entries, &e->list_node); } + e->tx_bps_avg = tx_bps_avg; } /* Add enabled members to 'bals' in descending order of tx_bytes. @@ -1402,8 +1407,8 @@ bond_rebalance(struct bond *bond) struct bond_member *to = bond_member_from_bal_node(ovs_list_back(&bals)); - overload = (from->tx_bytes - to->tx_bytes) / rebalance_interval_sec; - if (overload < (to->tx_bytes / rebalance_interval_sec) >> 5 || + overload = from->tx_bps_avg - to->tx_bps_avg; + if (overload < to->tx_bps_avg >> 5 || overload * 8 / 1000000 < overload_threshold) { /* The extra average load per second on 'from' (and all less-loaded * members), compared to that of 'to' (the least-loaded member), is @@ -1414,7 +1419,7 @@ bon d_rebalance(struct bond *bond) /* 'from' is carrying significantly more load than 'to'. Pick hashes * to move from 'from' to 'to'. */ - size_t cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes); + size_t cnt = choose_entries_to_migrate(from, to->tx_bps_avg, hashes); if (!cnt) { /* Can't usefully migrate anything away from 'from'. * Don't reconsider it. */ @@ -1440,11 +1445,9 @@ bond_rebalance(struct bond *bond) rebalanced = true; } - /* Implement exponentially weighted moving average. A weight of 1/2 causes - * historical data to decay to <1% in 7 rebalancing runs. 1,000,000 bytes - * take 20 rebalancing runs to decay to 0 and get deleted entirely. */ + /* Zeroize tx_bytes so hash AVG speed can be calculated accurately. */ for (e = &bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) { - e->tx_bytes /= 2; + e->tx_bytes = 0; } if (rebalanced) { @@ -1455,8 +1458,8 @@ bond_rebalance(struct bond *bond) if (++i > 1) { ds_put_cstr(&member_stats, " and "); } - ds_put_format(&member_s tats, "%s:%"PRIu64"kB", member->name, - member->tx_bytes / 1024); + ds_put_format(&member_stats, "%s:%"PRIu64"kB/s", member->name, + member->tx_bps_avg / 1024); } VLOG_INFO("bond %s: rebalanced (now carrying: %s)", bond->name, ds_cstr(&member_stats)); -- regards, Vladislav Odintsov _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
