On Fri, Nov 14, 2025 at 7:25 AM Ilya Maximets <[email protected]> wrote:
> If a daemon has multiple remotes to connect to and it already reached > pretty high backoff interval on those connections, it is possible that > it will never be able to connect, if the DNS TTL value is too low. > > For example, if ovn-controller has 3 remotes in the ovn-remote > configuration, where each is presented as a host name, the following > is happening when it reaches default max backoff of 8 seconds: > > 1. Tries to connect to the first remote - sends async DNS request. > 2. Since jsonrpc/reconnect modules are not really aware of this, they > just treat connection as temporarily failed - 8 second backoff. > 3. After the backoff - switching to a new remote - sending DNS request. > 4. Temporarily failing - 8 second backoff. > 5. Switching to the third remote - sending DNS request. > 6. Temporarily failing - 8 second backoff. > 7. Finally trying the first remote again - checking DNS. > 8. There is a processed response, but it is already 24 seconds old. > 9. If DNS TTL is lower than 24 seconds - consider expired - send > a new DNS request. > 10. Go to step 2. > > With that, if DNS TTL is lower than 3x of the backoff interval, the > process will never be able to connect without some external help to > break the loop. > > A proper solution for this should include: > > 1. Making jsonrpc and reconnect and all the users of these modules > aware of the DNS request being made. This means introduction of > a new RECONNECT state for DNS request and not switching to a new > target while we're still in this state. > > 2. Making the poll loop state machine properly react to DNS responses > by waiting on the file descriptor provided by the unbound library. > > However, such solution will be very invasive to the code structure > and all the involved libraries, so it may not be something that we > would want to backport as a bug fix to stable branches. > > Instead, making a much simpler change to allow use of never previously > accessed DNS replies for a short period of time, so the loop can be > broken. It's not caching if we just requested the value, but didn't > use it yet, it's a "transaction in progress" situation in which we can > use the response even if TTL is zero. Without a proper solution though > we can't be sure that the process will ever look at the result of > asynchronous request, so we need to have an upper limit for such > "transactions in progress". Limiting them to a fairly arbitrary, but > big enough, value of 5 minutes. In the worst case where the address > actually goes stale in between our request and the first access, we'll > try to use the stale value once and then re-request right away on > failure to connect. > > This solution seems reasonable and simple enough to backport to stable > branches while working on the proper solution on main. > > Reported-at: > https://mail.openvswitch.org/pipermail/ovs-discuss/2025-June/053738.html > Signed-off-by: Ilya Maximets <[email protected]> > This looks reasonable to me. Acked-by: Mike Pattrick <[email protected]> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
