On 3/3/25 22:37, Ilya Maximets wrote:
> On 3/3/25 10:19, Martin Morgenstern via dev wrote:
>> This is a robustness improvement for a specific case where very long
>> ovn-controller iterations (about ~20s) and long JSONRPC message queues
>> in the ovsdb client synchronization layer lead to unanswered echo
>> requests that in turn lead to connections being dropped.
>>
>> In such a case, the echo request is "stuck" in the incoming message
>> queue and might not be processed in time, because we process everything
>> in small batches.
>>
>> Thus, instead of waiting until we can process an incoming echo request,
>> we remember our last send activity and preemptively send an echo reply
>> when needed.
>>
>> Signed-off-by: Martin Morgenstern <[email protected]>
>> ---
> 
> Hi, Martin.  Thanks for the set!
> 

Hi Ilya,

thanks for the feedback!

> Regrading this particular change though, I don't think we should do that.
> Generating unsolicited echo replies defeats one of the reasons those probes
> exist in the first place.  We need to be able to check that the other
> side receives our messages, and if the other side just generates replies
> periodically, we loose that ability.  So, if the connection is half-fenced
> (packets can go one way, but not the other), we'll be sending echo requests
> and will receive echo replies even though the request or any other data
> is not able to reach the client.  I've seen such issues in real-world OVN
> setups.  This condition must be detectable with the probe.

I agree with you. Half-fencing is something we didn't consider when
writing this, your hint is very much appreciated. I'll drop this change
from v2 of the patch set.

> If the application can't process all the messages in time, application
> should set higher probe intervals, so it can reply in time.  If it can not
> keep up with the messages and the incoming queue is ever-growing, that
> needs to be fixed on the application side as well as it will never be up
> to date and will fall behind more and more over time.

We have been using rather high intervals already (60s) and were
initially reluctant to raise them even further, also partially because
we couldn't set the probe intervals individually in our OpenStack LCM
and this meant we always increase probe intervals on both sides of the
connection.

As an alternative to the gratuitous echos, we are now evaluating
asymmetric probe intervals, i.e., only raise the interval on the
southbound relay side (or even disable it) and leave the ovn-controller
side as-is (external_ids:ovn-remote-probe-interval).

The other patches in the set, together with the ones from [1], already
helped us to keep the queue from getting too big.

> I'll try to take a look at the other patches in the set later this week.
> 
> Best regards, Ilya Maximets.

Thanks a lot,
Martin

[1] <https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/421715.html>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to