Re: [ovs-dev] [PATCH] reconnect: Fix broken inactivity probe if there is no other reason to wake up.

Dumitru Ceara Tue, 22 Feb 2022 08:00:36 -0800

On 2/22/22 16:45, Ilya Maximets wrote:
> On 2/22/22 13:05, Dumitru Ceara wrote:
>> On 2/21/22 15:16, Ilya Maximets wrote:
>>> The purpose of reconnect_deadline__() function is twofold:
>>>
>>> 1. Its result is used to tell if the state has to be changed right now
>>>    in reconnect_run().
>>> 2. Its result also used to determine when the process need to wake up
>>>    and call reconnect_run() for a next time, i.e. when the state may
>>>    need to be changed next time.
>>>
>>> Since introduction of the 'receive-attempted' feature, the function
>>> returns LLONG_MAX if the deadline is in the future.  That works for
>>> the first case, but doesn't for the second one, because we don't
>>> really know when we need to call reconnect_run().
>>>
>>> This is the problem for applications where jsonrpc connection is the
>>> only source of wake ups, e.g. ovn-northd.  When the network goes down
>>> silently, e.g. server looses IP address due to DHCP failure, ovn-northd
>>> will sleep in the poll loop indefinitely after being told that it
>>> doesn't need to call reconnect_run() (deadline == LLONG_MAX).
>>>
>>> Fixing that by actually returning the expected time if it is in the
>>> future, so we will know when to wake up.  In order to keep the
>>> 'receive-attempted' feature, returning 'now + 1' in case where the
>>> time has already passed, but receive wasn't attempted.  That will
>>> trigger a fast wake up, so the application will be able to attempt the
>>> receive even if there was no real events.  In a correctly written
>>> application we should not fall into this case more than once in a row.
>>> '+ 1' ensures that we will not transition into a different state
>>> prematurely, i.e. before the receive is actually attempted.
>>>
>>> Fixes: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long 
>>> poll intervals.")
>>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>>> ---
>>
>> Hi Ilya,
>>
>> This change looks good to me, I just have a couple minor comment-related
>> remarks; in any case:
>>
>> Acked-by: Dumitru Ceara <dce...@redhat.com>
>>
>> Regards,
>> Dumitru
>>
>>>
>>> CC: Wentao Jia
>>> This might be the problem you tried to fix in northd by setting a probe
>>> interval.  Would be great, if you can test your case with this patch.
>>>
>>>  lib/reconnect.c         |  29 +++++++++---
>>>  python/ovs/reconnect.py |  40 +++++++++++-----
>>>  tests/reconnect.at      | 102 +++++++++++++++++++++++++++++++---------
>>>  3 files changed, 133 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>> index a929ddfd2..f6fd165e4 100644
>>> --- a/lib/reconnect.c
>>> +++ b/lib/reconnect.c
>>> @@ -75,7 +75,8 @@ struct reconnect {
>>>  
>>>  static void reconnect_transition__(struct reconnect *, long long int now,
>>>                                     enum state state);
>>> -static long long int reconnect_deadline__(const struct reconnect *);
>>> +static long long int reconnect_deadline__(const struct reconnect *,
>>> +                                          long long int now);
>>>  static bool reconnect_may_retry(struct reconnect *);
>>>  
>>>  static const char *
>>> @@ -539,7 +540,7 @@ reconnect_transition__(struct reconnect *fsm, long long 
>>> int now,
>>>  }
>>>  
>>>  static long long int
>>> -reconnect_deadline__(const struct reconnect *fsm)
>>> +reconnect_deadline__(const struct reconnect *fsm, long long int now)
>>>  {
>>>      ovs_assert(fsm->state_entered != LLONG_MIN);
>>>      switch (fsm->state) {
>>> @@ -557,8 +558,18 @@ reconnect_deadline__(const struct reconnect *fsm)
>>>          if (fsm->probe_interval) {
>>>              long long int base = MAX(fsm->last_activity, 
>>> fsm->state_entered);
>>>              long long int expiration = base + fsm->probe_interval;
>>> -            if (fsm->last_receive_attempt >= expiration) {
>>> +            if (now < expiration || fsm->last_receive_attempt >= 
>>> expiration) {
>>> +                /* We still have time before the expiration or the time has
>>> +                 * already passed and there was no activity.  In the first 
>>> case
>>> +                 * we need to wait for the expiration, in the second - 
>>> we're
>>> +                 * already past the deadline. */
>>>                  return expiration;
>>> +            } else {
>>> +                /* Time has already passed, but we didn't attempt to 
>>> receive
>>> +                 * anything.  We need to wake up and try to receive even if
>>> +                 * nothing is pending, so we can update the expiration time
>>> +                 * or transition into S_IDLE state. */
>>> +                return now + 1;
>>>              }
>>>          }
>>>          return LLONG_MAX;
>>> @@ -566,8 +577,14 @@ reconnect_deadline__(const struct reconnect *fsm)
>>>      case S_IDLE:
>>>          if (fsm->probe_interval) {
>>>              long long int expiration = fsm->state_entered + 
>>> fsm->probe_interval;
>>> -            if (fsm->last_receive_attempt >= expiration) {
>>> +            if (now < expiration || fsm->last_receive_attempt >= 
>>> expiration) {
>>
>> Should we duplicate or add a similar comment to what we have for S_ACTIVE?
> 
> I thought about that, but these 'if's are very close to each other and the
> comment will be exactly the same, so I didn't want to just copy-paste it.
> 
> Does that make sense?
>


I see but then the second comment, for the 'else' in state S_IDLE, also
seems very similar to the one in state S_ACTIVE.  Can we do the same
thing for it too?  E.g., change the one in S_ACTIVE to:

"Time has already passed, but we didn't attempt to receive anything.  We
need to wake up and try to receive even if nothing is pending, so we can
update the expiration time or transition to a different state."

Disconnect happens on a state transition too so the comment should apply
to both cases.

In any case, I'm OK with the original patch too, this is just nit
picking. :)

> Best regards, Ilya Maximets.
> 

Regards,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] reconnect: Fix broken inactivity probe if there is no other reason to wake up.

Reply via email to