On 8/1/25 4:26 AM, chenyongch...@cestc.cn wrote: > Hi all: > Thank you for your attention to this issue. > > Initially, we noticed that bond_shift_load triggered > bond->bond_revalidate = true, which then led to bond_reconfigure -> > bond_entry_reset being executed. This ultimately caused > USERSPACE_INVALID_PORT_DROP to occur. > We believe that no packet loss should happen during traffic load > balancing. > > Later, we discovered that changes to the bond-rebalance-interval > setting could also lead to the same issue, which is why we reported the > problem. > > In fact, we are more concerned about the packet drops caused by > bond_shift_load. If the packet loss due to bond-rebalance-interval changes is > truly unavoidable, we can accept that. > Apologies for not describing the issue clearly enough earlier.
No worries. Thanks for the clarification! > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > chenyongch...@cestc.cn > > > *发件人:* Ilya Maximets <mailto:i.maxim...@ovn.org> > *发送时间:* 2025-08-01 04:48 > *收件人:* Aaron Conole <mailto:acon...@redhat.com> > *抄送:* i.maximets <mailto:i.maxim...@ovn.org>; dev > <mailto:d...@openvswitch.org>; mhou <mailto:m...@redhat.com>; chenyongchang > <mailto:chenyongch...@cestc.cn>; Adrian Moreno <mailto:amore...@redhat.com> > *主题:* Re: [ovs-dev] [PATCH] bond: Do not flag a rebalance when adjusting > time.【请注意,邮件由i.maximets....@gmail.com代发】 > On 7/31/25 8:41 PM, Aaron Conole wrote: > > Ilya Maximets <i.maxim...@ovn.org> writes: > > > >> On 7/30/25 4:47 PM, Aaron Conole via dev wrote: > >>> When adjusting bond parameters, any adjustment is considered > sufficient > >>> for triggering a rebalance. This is a very simplistic config update > >>> scheme that triggers complete rebalancing even if the time adjustment > >>> would move the next expiration out beyond the last calculated > expiration. > >>> > >>> For the interval parameter only, we can simply recalculate the expiry > >>> deadline and let the next bond_run() event do the rebalance if needed. > >>> Even if the recalculation would cause the deadline to have occurred in > >>> the past, it should execute on the next bond_run() anyway. This is > >>> still okay, as the rebalance interval timeout may not result in a > >>> full rebalance anyway. > >>> > >>> Reported-at: > https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg10409.html > >>> Signed-off-by: Aaron Conole <acon...@redhat.com> > >>> --- > >>> ofproto/bond.c | 8 +++++++- > >>> 1 file changed, 7 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/ofproto/bond.c b/ofproto/bond.c > >>> index 3859ddca08..86e21607e5 100644 > >>> --- a/ofproto/bond.c > >>> +++ b/ofproto/bond.c > >>> @@ -459,8 +459,14 @@ bond_reconfigure(struct bond *bond, const struct > bond_settings *s) > >>> } > >>> > >>> if (bond->rebalance_interval != s->rebalance_interval) { > >>> + /* Recompute the next rebalance interval by moving the > next_rebalance > >>> + * to be offset by the new interval. Then let the rebalance > code > >>> + * trigger a rebalance based on the new details. > >> > >> This sentence doesn't seem necessary. > > > > Removed. > > > >>> In this case, if > >>> + * all that was updated is the rebalance interval, we can > skip > >>> + * triggering the rest of the port reconfigure mechanism. */ > >> > >> s/reconfigure mechanism/reconfiguration/ > > > > Ack. > > > >>> + int old_start_time = bond->next_rebalance - > bond->rebalance_interval; > >>> bond->rebalance_interval = s->rebalance_interval; > >>> - revalidate = true; > >>> + bond->next_rebalance = old_start_time + > bond->rebalance_interval; > >>> } > >> > >> We should likely still revalidate when the value goes to or from zero, > >> because that impacts if the bond considered balanced or not. Even if > >> that doesn't cause any problems today, it will inevitably be forgotten > >> in the future. > > > > That makes sense to me - I'll add a case for it. > > > >> Also, the subtraction may technically overflow if the next_rebalance is > >> not initialized. E.g. if it is not initialized yet when the bond is > >> going from active-backup to balance-tcp. Which may be fine, given > those > >> are signed integers, but still doesn't feel right. > > > > I'm trying to envision a scenario that would make this happen. For > > example, it gets reset when going from ab -> balance-{tcp,slb}. And > > that will be reconfigured later in this very code path as well (since we > > will call the bond_entry_reset if the config got changed for any config > > option other than the interval). I can put a guard on !next_rebalance > > and tell it to trigger revalidation then (since the very first execution > > or something should be initialized to '0' and obviously that could > > probably find a trigger). But still struggling to see a place where > > changing bond mode won't also force the whole thing to be reinitialized > > in bond_entry_reset() since revalidate will be 'true' via other means. > > It doesn't cause any issues, I agree, but it just didn't feel right. :) > Feel free to keep this code as-is. > > > > >> And this patch needs a Fixes tag. > > > > I didn't put one because I'm not sure if it's really a bugfix or an > > improvement. If you consider it one, I'll add the fixes tag. > > Looking closer to the code, there are two separate issues here: > > 1. Traffic is re-shuffled among bond members on the rebalance interval > change. > > 2. Traffic is being dropped for some time after the bond is reconfigured > until the next bond_run(). This happens if other config options > are changed as well. > > And, I think, the latter is what the original reporter was concerned > about. > This is a behavior introduced in: > > 586adfd047cb ("bond: Avoid deadlock while updating post recirculation > rules.") > > Before that change we would not update post-recirc rules while the > configuration is cleared and not re-constructed yet. But after the > 586adfd047cb fix, we're now calling the update after we zeroed out the > configuration, effectively just removing all the post recirculation rules > or the lb-output mappings. > > Saying that, maybe the real fix is to use bond_update_post_recirc_rules__ > with force instead of direct call to update_recirc_rules(). That will > ensure that the configuration is not blank and we actually have real > post-recirc rules / lb-output mapping at all times. I do not remember > why I went with the direct call in that fix though. > > What do you think? > > Not re-setting the config on timeout change is still needed, as that > avoids re-shuffling the traffic among bond members (issue 1). > > The behavior in point 1 was always this way, AFAIU, so that can be treated > as just an enhancement, I suppose. > > Best regards, Ilya Maximets. > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev