On 2/22/22 13:05, Dumitru Ceara wrote: > On 2/21/22 15:16, Ilya Maximets wrote: >> The purpose of reconnect_deadline__() function is twofold: >> >> 1. Its result is used to tell if the state has to be changed right now >> in reconnect_run(). >> 2. Its result also used to determine when the process need to wake up >> and call reconnect_run() for a next time, i.e. when the state may >> need to be changed next time. >> >> Since introduction of the 'receive-attempted' feature, the function >> returns LLONG_MAX if the deadline is in the future. That works for >> the first case, but doesn't for the second one, because we don't >> really know when we need to call reconnect_run(). >> >> This is the problem for applications where jsonrpc connection is the >> only source of wake ups, e.g. ovn-northd. When the network goes down >> silently, e.g. server looses IP address due to DHCP failure, ovn-northd >> will sleep in the poll loop indefinitely after being told that it >> doesn't need to call reconnect_run() (deadline == LLONG_MAX). >> >> Fixing that by actually returning the expected time if it is in the >> future, so we will know when to wake up. In order to keep the >> 'receive-attempted' feature, returning 'now + 1' in case where the >> time has already passed, but receive wasn't attempted. That will >> trigger a fast wake up, so the application will be able to attempt the >> receive even if there was no real events. In a correctly written >> application we should not fall into this case more than once in a row. >> '+ 1' ensures that we will not transition into a different state >> prematurely, i.e. before the receive is actually attempted. >> >> Fixes: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long >> poll intervals.") >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> >> --- > > Hi Ilya, > > This change looks good to me, I just have a couple minor comment-related > remarks; in any case: > > Acked-by: Dumitru Ceara <dce...@redhat.com> > > Regards, > Dumitru > >> >> CC: Wentao Jia >> This might be the problem you tried to fix in northd by setting a probe >> interval. Would be great, if you can test your case with this patch. >> >> lib/reconnect.c | 29 +++++++++--- >> python/ovs/reconnect.py | 40 +++++++++++----- >> tests/reconnect.at | 102 +++++++++++++++++++++++++++++++--------- >> 3 files changed, 133 insertions(+), 38 deletions(-) >> >> diff --git a/lib/reconnect.c b/lib/reconnect.c >> index a929ddfd2..f6fd165e4 100644 >> --- a/lib/reconnect.c >> +++ b/lib/reconnect.c >> @@ -75,7 +75,8 @@ struct reconnect { >> >> static void reconnect_transition__(struct reconnect *, long long int now, >> enum state state); >> -static long long int reconnect_deadline__(const struct reconnect *); >> +static long long int reconnect_deadline__(const struct reconnect *, >> + long long int now); >> static bool reconnect_may_retry(struct reconnect *); >> >> static const char * >> @@ -539,7 +540,7 @@ reconnect_transition__(struct reconnect *fsm, long long >> int now, >> } >> >> static long long int >> -reconnect_deadline__(const struct reconnect *fsm) >> +reconnect_deadline__(const struct reconnect *fsm, long long int now) >> { >> ovs_assert(fsm->state_entered != LLONG_MIN); >> switch (fsm->state) { >> @@ -557,8 +558,18 @@ reconnect_deadline__(const struct reconnect *fsm) >> if (fsm->probe_interval) { >> long long int base = MAX(fsm->last_activity, >> fsm->state_entered); >> long long int expiration = base + fsm->probe_interval; >> - if (fsm->last_receive_attempt >= expiration) { >> + if (now < expiration || fsm->last_receive_attempt >= >> expiration) { >> + /* We still have time before the expiration or the time has >> + * already passed and there was no activity. In the first >> case >> + * we need to wait for the expiration, in the second - we're >> + * already past the deadline. */ >> return expiration; >> + } else { >> + /* Time has already passed, but we didn't attempt to receive >> + * anything. We need to wake up and try to receive even if >> + * nothing is pending, so we can update the expiration time >> + * or transition into S_IDLE state. */ >> + return now + 1; >> } >> } >> return LLONG_MAX; >> @@ -566,8 +577,14 @@ reconnect_deadline__(const struct reconnect *fsm) >> case S_IDLE: >> if (fsm->probe_interval) { >> long long int expiration = fsm->state_entered + >> fsm->probe_interval; >> - if (fsm->last_receive_attempt >= expiration) { >> + if (now < expiration || fsm->last_receive_attempt >= >> expiration) { > > Should we duplicate or add a similar comment to what we have for S_ACTIVE?
I thought about that, but these 'if's are very close to each other and the comment will be exactly the same, so I didn't want to just copy-paste it. Does that make sense? Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev