On 29.03.2025 01:41, Ilya Maximets wrote:

On 3/28/25 11:00, Vladislav Odintsov wrote:


On 26.03.2025 15:30, Ilya Maximets wrote:


On 3/25/25 00:21, Vladislav Odintsov wrote:


With this commit the rebalance of hash entries between bonding members
becomes less frequent.  If we know all bond members' speed, we do not
move hash entries between them if load difference is less than 1.5% of
minimum link speed of bond members.

Reported-at: 
https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html
Suggested-by: Ilya Maximets <[email protected]><mailto:[email protected]>
Signed-off-by: Vladislav Odintsov 
<[email protected]><mailto:[email protected]>
---
v1 -> v2:
   - Addressed Ilya's, Eelco's, Mike's review comments.
   - Docs updated.
   - NEWS entry added.
---
  Documentation/topics/bonding.rst |  3 ++-
  NEWS                             |  4 ++++
  ofproto/bond.c                   | 25 +++++++++++++++++++------
  3 files changed, 25 insertions(+), 7 deletions(-)


Hi, Vladislav.  Thanks for v2!

Have you seen my reply for your observation about comparing speed to raw
byte counters here:
   https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422255.html

Tl;DR; I think you're right and we need to change the way we count the load
in order to be able to compare the load to a link speed.

Best regards, Ilya Maximets.



Hi Ilya,

Thanks for review!

I changed the condition before calling a bond_shift_load() function, so
now Bps with Bps instead of B with Bps should be compared.  It seems to
me that this is good enough.

Do I understand you correctly, that you proposed to change the movement
logic inside bond_shift_load() function?  If yes, I'm wondering why do
we care, whether we compare total carried bytes during last rebalance
interval or an average bytes per second during the same interval?  In
both cases we seem to fairly compare same units: B vs B or Bps vs Bps,
so the current logic of this function looks correct for me.  And in both
cases the result should be the same.  Am I missing something?




Hmm.  It's complicated.  Yes, you're comparing apples to apples here, i.e.
this version compares speed to speed.  The problem, I think, is that our
total amount of bytes divided by the rebalance interval doesn't give a
correct or accurate enough approximation of the average speed.

Let's say we have a constant 100Bps traffic on the port.  And let's say
we have a standard 10sec rebalance interval.

After the first 10 seconds, our tx_bytes on this port will be 1000B.
Then we'll divide it by 2 for our ad-hoc exponentially weighted moving
average implementation.  So, it becomes 500B.  After the next interval
we'll have another 1000B plus our 500B from the previous interval, so
1500B in total.  That, divided by the rebalancing interval, gives us
150Bps "average speed", which is 50% higher than the actual average speed
on this interface.  Next time it will be 175, and so on.  In a long run
this is 100 * (1 + 1/2 + 1/4 + ...) = 200, which 2x of our actual average
speed.  So, on a constant packet rate, this way of measuring is not
accurate at all.

I'm not doing the math for the non-constant speed, but I'd guess it will
not be accurate enough either, even if it may average better in some
scenarios it might be much worse in others.  And I'm not sure if we can
make this style of calculation accurate enough in all cases by just
choosing a different divisor.

All in all, I think, we need to change the way we calculate the load in
order to have a more or less accurate idea about the average speed.

The Exponential Moving Average from lib/mov-avg.h should give us much
more accurate result.  The problem, however, is that we can't easily
shift the load if we calculate the average on a sum.  So, we'll need to
calculate an average speed per hash and then compare sums of averages
with the link speed.

Does that make sense?

Best regards, Ilya Maximets.

Hi, Ilya.

I need to clarify, do we divide each hash by 2 to achieve next points:

1. have a monotonically decreasing tx_bytes to not to overload uint64_t; 2. at 
the same time "save" the knowledge of previous load in order to avoid frequent 
unnecessary shifts.

Is that correct?

Also, I'm not sure I fully understood your proposed approach... Do you want to 
switch from tx_bytes as a cumulative metric to a new "avg tx per second" 
metric? I guess this can be done approximately in the next order:

1. calculate the avg speed per second per each hash (tx_bytes / 
rebalance_interval) on rebalance run; 2. store this average speed inside 
hash_entry; 3. summarize hash_entries speed in "member average speed"; 4. do 
all calculations in bond_rebalance() in "avg bps" manner instead of "total 
bytes": here we can use same algorithm in bond_shift_load just re-orienting it 
to bps variable.

Something like this (note: diff is based on current patch; not tested, only as 
an example of idea):

diff --git a/ofproto/bond.c b/ofproto/bond.c index e2cbdd5ec..119baef4e 100644 
--- a/ofproto/bond.c +++ b/ofproto/bond.c @@ -63,6 +63,8 @@ struct bond_entry { 
struct bond_member *member; /* Assigned member, NULL if unassigned. */ uint64_t 
tx_bytes /* Count of bytes recently transmitted. */ OVS_GUARDED_BY(rwlock); + 
/* Average bytes per second recently transmitted. */ + uint64_t tx_bps_avg 
OVS_GUARDED_BY(rwlock); struct ovs_list list_node; /* In bond_member's 
'entries' list. */ /* Recirculation. @@ -95,7 +97,7 @@ struct bond_member { /* 
Rebalancing info. Used only by bond_rebalance(). */ struct ovs_list bal_node; 
/* In bond_rebalance()'s 'bals' list. */ struct ovs_list entries; /* 'struct 
bond_entry's assigned here. */ - uint64_t tx_bytes; /* Sum across 'tx_bytes' of 
entries. */ + uint64_t tx_bps_avg; /* Sum across 'tx_bps_avg' of entries. */ }; 
/* A bond, that is, a set of network devices grouped to improve performance or 
@@ -1163,8 +1165,8 @@ log_bals(struct bond *bond, const struct
  ovs_list *bals) if (ds.length) { ds_put_char(&ds, ','); } - 
ds_put_format(&ds, " %s %"PRIu64"kB", - member->name, member->tx_bytes / 1024); 
+ ds_put_format(&ds, " %s %"PRIu64"kB/s", + member->name, member->tx_bps_avg / 
1024); if (!member->enabled) { ds_put_cstr(&ds, " (disabled)"); @@ -1195,19 
+1197,19 @@ bond_shift_load(struct bond_entry *hash, struct bond_member *to) { 
struct bond_member *from = hash->member; struct bond *bond = from->bond; - 
uint64_t delta = hash->tx_bytes; + uint64_t delta = hash->tx_bps_avg; 
VLOG_DBG("bond %s: shift %"PRIu64"kB of load (with hash %"PRIdPTR") " "from %s 
to %s (now carrying %"PRIu64"kB and " "%"PRIu64"kB load, respectively)", 
bond->name, delta / 1024, hash - bond->hash, from->name, to->name, - 
(from->tx_bytes - delta) / 1024, - (to->tx_bytes + delta) / 1024); + 
(from->tx_bps_avg - delta) / 1024, + (to->tx_bps_avg + delta) / 1024); /* Shift 
load away from 'from' to 'to'. */ - from->tx_bytes -= delta; - to->tx_bytes += 
delta; + from->tx_bps_avg -=
  delta; + to->tx_bps_avg += delta; /* Arrange for flows to be revalidated. */ 
hash->member = to; @@ -1222,15 +1224,16 @@ bond_shift_load(struct bond_entry 
*hash, struct bond_member *to) * The list of entries is sorted in descending 
order of load. This allows us * to collect subset of entries with accumulated 
load close to ideal. */ static size_t -choose_entries_to_migrate(const struct 
bond_member *from, uint64_t to_tx_bytes, +choose_entries_to_migrate(const 
struct bond_member *from, + uint64_t to_tx_bps_avg, struct bond_entry 
**to_migrate) OVS_REQ_WRLOCK(rwlock) { struct bond_entry *e; /* Note, the ideal 
traffic is the mid point between 'from' and 'to'. * This value does not change 
by rebalancing. */ - uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 
2; - uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes; + uint64_t 
ideal_tx_bps_avg = (from->tx_bps_avg + to_tx_bps_avg) / 2; + uint64_t 
ideal_delta = ideal_tx_bps_avg - to_tx_bps_avg; uint64_t delta = 0; /* The 
amount to re
 balance. */ uint64_t new_low; /* The lower bandwidth between 'to' and 'from' * 
after rebalancing. */ @@ -1244,10 +1247,10 @@ choose_entries_to_migrate(const 
struct bond_member *from, uint64_t to_tx_bytes, } LIST_FOR_EACH (e, list_node, 
&from->entries) { - if (delta + e->tx_bytes <= ideal_delta) { + if (delta + 
e->tx_bps_avg <= ideal_delta) { /* Take next entry if amount to rebalance will 
not exceed ideal. */ to_migrate[cnt++] = e; - delta += e->tx_bytes; + delta += 
e->tx_bps_avg; } if (ideal_delta - delta < migration_threshold) { /* Stop 
collecting hashes if we're close enough to the ideal value @@ -1263,13 +1266,13 
@@ choose_entries_to_migrate(const struct bond_member *from, uint64_t 
to_tx_bytes, ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), 
list_node); - delta = closest->tx_bytes; + delta = closest->tx_bps_avg; 
to_migrate[cnt++] = closest; } - new_low = MIN(from->tx_bytes - delta, 
to_tx_bytes + delta); - if ((new_low > to_tx_bytes) && - (new_low - to_tx_bytes 
>= migrati
 on_threshold)) { + new_low = MIN(from->tx_bps_avg - delta, to_tx_bps_avg + 
delta); + if ((new_low > to_tx_bps_avg) && + (new_low - to_tx_bps_avg >= 
migration_threshold)) { /* Only rebalance if the new 'low' is closer to to the 
mid point and the * improvement of traffic deviation from the ideal split 
exceeds 10% * (migration threshold). @@ -1290,7 +1293,7 @@ insert_bal(struct 
ovs_list *bals, struct bond_member *member) struct bond_member *pos; 
LIST_FOR_EACH (pos, bal_node, bals) { - if (member->tx_bytes > pos->tx_bytes) { 
+ if (member->tx_bps_avg > pos->tx_bps_avg) { break; } } @@ -1356,7 +1359,7 @@ 
bond_rebalance(struct bond *bond) /* Add each bond_entry to its member's 
'entries' list. * Compute each member's tx_bytes as the sum of its entries' 
tx_bytes. */ HMAP_FOR_EACH (member, hmap_node, &bond->members) { - 
member->tx_bytes = 0; + member->tx_bps_avg = 0; 
ovs_list_init(&member->entries); } @@ -1368,10 +1371,12 @@ 
bond_rebalance(struct bond *bond) /* Iteration over sorted bond hash
 es will give us sorted 'entries'. */ for (int i = 0; i < BOND_BUCKETS; i++) { 
e = hashes[i]; + uint64_t tx_bps_avg = e->tx_bytes / rebalance_interval_sec; if 
(e->member && e->tx_bytes) { - e->member->tx_bytes += e->tx_bytes; + 
e->member->tx_bps_avg += tx_bps_avg + e->tx_bps_avg; 
ovs_list_push_back(&e->member->entries, &e->list_node); } + e->tx_bps_avg = 
tx_bps_avg; } /* Add enabled members to 'bals' in descending order of tx_bytes. 
@@ -1402,8 +1407,8 @@ bond_rebalance(struct bond *bond) struct bond_member *to 
= bond_member_from_bal_node(ovs_list_back(&bals)); - overload = (from->tx_bytes 
- to->tx_bytes) / rebalance_interval_sec; - if (overload < (to->tx_bytes / 
rebalance_interval_sec) >> 5 || + overload = from->tx_bps_avg - to->tx_bps_avg; 
+ if (overload < to->tx_bps_avg >> 5 || overload * 8 / 1000000 < 
overload_threshold) { /* The extra average load per second on 'from' (and all 
less-loaded * members), compared to that of 'to' (the least-loaded member), is 
@@ -1414,7 +1419,7 @@ bon
 d_rebalance(struct bond *bond) /* 'from' is carrying significantly more load 
than 'to'. Pick hashes * to move from 'from' to 'to'. */ - size_t cnt = 
choose_entries_to_migrate(from, to->tx_bytes, hashes); + size_t cnt = 
choose_entries_to_migrate(from, to->tx_bps_avg, hashes); if (!cnt) { /* Can't 
usefully migrate anything away from 'from'. * Don't reconsider it. */ @@ 
-1440,11 +1445,9 @@ bond_rebalance(struct bond *bond) rebalanced = true; } - /* 
Implement exponentially weighted moving average. A weight of 1/2 causes - * 
historical data to decay to <1% in 7 rebalancing runs. 1,000,000 bytes - * take 
20 rebalancing runs to decay to 0 and get deleted entirely. */ + /* Zeroize 
tx_bytes so hash AVG speed can be calculated accurately. */ for (e = 
&bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) { - e->tx_bytes /= 2; + 
e->tx_bytes = 0; } if (rebalanced) { @@ -1455,8 +1458,8 @@ 
bond_rebalance(struct bond *bond) if (++i > 1) { ds_put_cstr(&member_stats, " 
and "); } - ds_put_format(&member_s
 tats, "%s:%"PRIu64"kB", member->name, - member->tx_bytes / 1024); + 
ds_put_format(&member_stats, "%s:%"PRIu64"kB/s", member->name, + 
member->tx_bps_avg / 1024); } VLOG_INFO("bond %s: rebalanced (now carrying: 
%s)", bond->name, ds_cstr(&member_stats));

--
regards,
Vladislav Odintsov
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to