On 14 Nov 2025, at 13:24, Ilya Maximets wrote:
> If a daemon has multiple remotes to connect to and it already reached
> pretty high backoff interval on those connections, it is possible that
> it will never be able to connect, if the DNS TTL value is too low.
>
> For example, if ovn-controller has 3 remotes in the ovn-remote
> configuration, where each is presented as a host name, the following
> is happening when it reaches default max backoff of 8 seconds:
>
> 1. Tries to connect to the first remote - sends async DNS request.
> 2. Since jsonrpc/reconnect modules are not really aware of this, they
> just treat connection as temporarily failed - 8 second backoff.
> 3. After the backoff - switching to a new remote - sending DNS request.
> 4. Temporarily failing - 8 second backoff.
> 5. Switching to the third remote - sending DNS request.
> 6. Temporarily failing - 8 second backoff.
> 7. Finally trying the first remote again - checking DNS.
> 8. There is a processed response, but it is already 24 seconds old.
> 9. If DNS TTL is lower than 24 seconds - consider expired - send
> a new DNS request.
> 10. Go to step 2.
>
> With that, if DNS TTL is lower than 3x of the backoff interval, the
> process will never be able to connect without some external help to
> break the loop.
>
> A proper solution for this should include:
>
> 1. Making jsonrpc and reconnect and all the users of these modules
> aware of the DNS request being made. This means introduction of
> a new RECONNECT state for DNS request and not switching to a new
> target while we're still in this state.
>
> 2. Making the poll loop state machine properly react to DNS responses
> by waiting on the file descriptor provided by the unbound library.
>
> However, such solution will be very invasive to the code structure
> and all the involved libraries, so it may not be something that we
> would want to backport as a bug fix to stable branches.
>
> Instead, making a much simpler change to allow use of never previously
> accessed DNS replies for a short period of time, so the loop can be
> broken. It's not caching if we just requested the value, but didn't
> use it yet, it's a "transaction in progress" situation in which we can
> use the response even if TTL is zero. Without a proper solution though
> we can't be sure that the process will ever look at the result of
> asynchronous request, so we need to have an upper limit for such
> "transactions in progress". Limiting them to a fairly arbitrary, but
> big enough, value of 5 minutes. In the worst case where the address
> actually goes stale in between our request and the first access, we'll
> try to use the stale value once and then re-request right away on
> failure to connect.
>
> This solution seems reasonable and simple enough to backport to stable
> branches while working on the proper solution on main.
>
> Reported-at:
> https://mail.openvswitch.org/pipermail/ovs-discuss/2025-June/053738.html
> Signed-off-by: Ilya Maximets <[email protected]>
The approach seems fine to me. I only did a code review, no actual testing.
Acked-by: Eelco Chaudron <[email protected]>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev