On 11/14/25 5:22 PM, Mike Pattrick wrote:
> On Fri, Nov 14, 2025 at 7:25 AM Ilya Maximets <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     If a daemon has multiple remotes to connect to and it already reached
>     pretty high backoff interval on those connections, it is possible that
>     it will never be able to connect, if the DNS TTL value is too low.
> 
>     For example, if ovn-controller has 3 remotes in the ovn-remote
>     configuration, where each is presented as a host name, the following
>     is happening when it reaches default max backoff of 8 seconds:
> 
>     1. Tries to connect to the first remote - sends async DNS request.
>     2. Since jsonrpc/reconnect modules are not really aware of this, they
>        just treat connection as temporarily failed - 8 second backoff.
>     3. After the backoff - switching to a new remote - sending DNS request.
>     4. Temporarily failing - 8 second backoff.
>     5. Switching to the third remote - sending DNS request.
>     6. Temporarily failing - 8 second backoff.
>     7. Finally trying the first remote again - checking DNS.
>     8. There is a processed response, but it is already 24 seconds old.
>     9. If DNS TTL is lower than 24 seconds - consider expired - send
>        a new DNS request.
>     10. Go to step 2.
> 
>     With that, if DNS TTL is lower than 3x of the backoff interval, the
>     process will never be able to connect without some external help to
>     break the loop.
> 
>     A proper solution for this should include:
> 
>     1. Making jsonrpc and reconnect and all the users of these modules
>        aware of the DNS request being made.  This means introduction of
>        a new RECONNECT state for DNS request and not switching to a new
>        target while we're still in this state.
> 
>     2. Making the poll loop state machine properly react to DNS responses
>        by waiting on the file descriptor provided by the unbound library.
> 
>     However, such solution will be very invasive to the code structure
>     and all the involved libraries, so it may not be something that we
>     would want to backport as a bug fix to stable branches.
> 
>     Instead, making a much simpler change to allow use of never previously
>     accessed DNS replies for a short period of time, so the loop can be
>     broken.  It's not caching if we just requested the value, but didn't
>     use it yet, it's a "transaction in progress" situation in which we can
>     use the response even if TTL is zero.  Without a proper solution though
>     we can't be sure that the process will ever look at the result of
>     asynchronous request, so we need to have an upper limit for such
>     "transactions in progress".  Limiting them to a fairly arbitrary, but
>     big enough, value of 5 minutes.  In the worst case where the address
>     actually goes stale in between our request and the first access, we'll
>     try to use the stale value once and then re-request right away on
>     failure to connect.
> 
>     This solution seems reasonable and simple enough to backport to stable
>     branches while working on the proper solution on main.
> 
>     Reported-at: 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2025-June/053738.html 
> <https://mail.openvswitch.org/pipermail/ovs-discuss/2025-June/053738.html>
>     Signed-off-by: Ilya Maximets <[email protected] 
> <mailto:[email protected]>>
> 
> 
> This looks reasonable to me.
> 
> Acked-by: Mike Pattrick <[email protected] <mailto:[email protected]>>
>  

Thanks, Mike and Eelco!  Applied and backported down to 3.3.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to