Re: [ovs-dev] [PATCH] reconnect: Fix broken inactivity probe if there is no other reason to wake up.

Ilya Maximets Tue, 22 Feb 2022 07:46:03 -0800

On 2/22/22 13:05, Dumitru Ceara wrote:
> On 2/21/22 15:16, Ilya Maximets wrote:
>> The purpose of reconnect_deadline__() function is twofold:
>>
>> 1. Its result is used to tell if the state has to be changed right now
>>    in reconnect_run().
>> 2. Its result also used to determine when the process need to wake up
>>    and call reconnect_run() for a next time, i.e. when the state may
>>    need to be changed next time.
>>
>> Since introduction of the 'receive-attempted' feature, the function
>> returns LLONG_MAX if the deadline is in the future.  That works for
>> the first case, but doesn't for the second one, because we don't
>> really know when we need to call reconnect_run().
>>
>> This is the problem for applications where jsonrpc connection is the
>> only source of wake ups, e.g. ovn-northd.  When the network goes down
>> silently, e.g. server looses IP address due to DHCP failure, ovn-northd
>> will sleep in the poll loop indefinitely after being told that it
>> doesn't need to call reconnect_run() (deadline == LLONG_MAX).
>>
>> Fixing that by actually returning the expected time if it is in the
>> future, so we will know when to wake up.  In order to keep the
>> 'receive-attempted' feature, returning 'now + 1' in case where the
>> time has already passed, but receive wasn't attempted.  That will
>> trigger a fast wake up, so the application will be able to attempt the
>> receive even if there was no real events.  In a correctly written
>> application we should not fall into this case more than once in a row.
>> '+ 1' ensures that we will not transition into a different state
>> prematurely, i.e. before the receive is actually attempted.
>>
>> Fixes: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long 
>> poll intervals.")
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>> ---
> 
> Hi Ilya,
> 
> This change looks good to me, I just have a couple minor comment-related
> remarks; in any case:
> 
> Acked-by: Dumitru Ceara <dce...@redhat.com>
> 
> Regards,
> Dumitru
> 
>>
>> CC: Wentao Jia
>> This might be the problem you tried to fix in northd by setting a probe
>> interval.  Would be great, if you can test your case with this patch.
>>
>>  lib/reconnect.c         |  29 +++++++++---
>>  python/ovs/reconnect.py |  40 +++++++++++-----
>>  tests/reconnect.at      | 102 +++++++++++++++++++++++++++++++---------
>>  3 files changed, 133 insertions(+), 38 deletions(-)
>>
>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>> index a929ddfd2..f6fd165e4 100644
>> --- a/lib/reconnect.c
>> +++ b/lib/reconnect.c
>> @@ -75,7 +75,8 @@ struct reconnect {
>>  
>>  static void reconnect_transition__(struct reconnect *, long long int now,
>>                                     enum state state);
>> -static long long int reconnect_deadline__(const struct reconnect *);
>> +static long long int reconnect_deadline__(const struct reconnect *,
>> +                                          long long int now);
>>  static bool reconnect_may_retry(struct reconnect *);
>>  
>>  static const char *
>> @@ -539,7 +540,7 @@ reconnect_transition__(struct reconnect *fsm, long long 
>> int now,
>>  }
>>  
>>  static long long int
>> -reconnect_deadline__(const struct reconnect *fsm)
>> +reconnect_deadline__(const struct reconnect *fsm, long long int now)
>>  {
>>      ovs_assert(fsm->state_entered != LLONG_MIN);
>>      switch (fsm->state) {
>> @@ -557,8 +558,18 @@ reconnect_deadline__(const struct reconnect *fsm)
>>          if (fsm->probe_interval) {
>>              long long int base = MAX(fsm->last_activity, 
>> fsm->state_entered);
>>              long long int expiration = base + fsm->probe_interval;
>> -            if (fsm->last_receive_attempt >= expiration) {
>> +            if (now < expiration || fsm->last_receive_attempt >= 
>> expiration) {
>> +                /* We still have time before the expiration or the time has
>> +                 * already passed and there was no activity.  In the first 
>> case
>> +                 * we need to wait for the expiration, in the second - we're
>> +                 * already past the deadline. */
>>                  return expiration;
>> +            } else {
>> +                /* Time has already passed, but we didn't attempt to receive
>> +                 * anything.  We need to wake up and try to receive even if
>> +                 * nothing is pending, so we can update the expiration time
>> +                 * or transition into S_IDLE state. */
>> +                return now + 1;
>>              }
>>          }
>>          return LLONG_MAX;
>> @@ -566,8 +577,14 @@ reconnect_deadline__(const struct reconnect *fsm)
>>      case S_IDLE:
>>          if (fsm->probe_interval) {
>>              long long int expiration = fsm->state_entered + 
>> fsm->probe_interval;
>> -            if (fsm->last_receive_attempt >= expiration) {
>> +            if (now < expiration || fsm->last_receive_attempt >= 
>> expiration) {
> 
> Should we duplicate or add a similar comment to what we have for S_ACTIVE?


I thought about that, but these 'if's are very close to each other and the
comment will be exactly the same, so I didn't want to just copy-paste it.

Does that make sense?

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] reconnect: Fix broken inactivity probe if there is no other reason to wake up.

Reply via email to