Hi Minh, Thanks for late comments. See my reply inline.
Best Regards, ThuanTr -----Original Message----- From: Minh Hon Chau <minh.c...@dektech.com.au> Sent: Friday, March 20, 2020 9:27 AM To: Thuan Tran <thuan.t...@dektech.com.au>; Thang Duc Nguyen <thang.d.ngu...@dektech.com.au>; Gary Lee <gary....@dektech.com.au> Cc: opensaf-devel@lists.sourceforge.net; Thanh Nguyen <thanh.ngu...@dektech.com.au> Subject: Re: [PATCH 1/1] osaf: enhance vm frozen detection in tcp.plugin [#3164] Hi Thuan, I'm adding Thanh since he's looking at the patch as well. I see you pushed the patch, here some late comments. Thanks Minh On 9/3/20 4:49 pm, thuan.tran wrote: > - Active SC will reboot if arb time somehow has big gap b/w heartbeats > in watch takeover request. Active SC may still OK but be rebooted > unexpectedly. > - Enhance VM was frozen detection base on arb time and local time counter. [M]: The patch has a general solution for both vm and container, and running a counter thread stead of reading time.time(), we need to explain it with a bit more details. [T]: Sorry that commit is merged, I cannot update commit message but I have explained in function time_counting(), hope it is still enough info. > --- > src/osaf/consensus/plugins/tcp/tcp.plugin | 43 ++++++++++++++++++----- > 1 file changed, 35 insertions(+), 8 deletions(-) > > diff --git a/src/osaf/consensus/plugins/tcp/tcp.plugin > b/src/osaf/consensus/plugins/tcp/tcp.plugin > index 0be20fcee..aaa1c1c3f 100755 > --- a/src/osaf/consensus/plugins/tcp/tcp.plugin > +++ b/src/osaf/consensus/plugins/tcp/tcp.plugin > @@ -23,8 +23,24 @@ import sys > import time > import xmlrpc.client > import syslog > +import threading > > > +counter_run = False > +counter_time = 0.0 > + > +def time_counting(hb_interval): > + ''' > + When node is frozen, if it is VM, clock time not jump > + but if it is container, clock time still jump. > + This function to help know node is frozen or arbitrator server issue > + ''' > + global counter_run, counter_time > + counter_time = 0.0 > + while (counter_run): > + time.sleep(hb_interval) > + counter_time += hb_interval > + > class ArbitratorPlugin(object): > """ This class represents a TCP Plugin """ > > @@ -478,6 +494,8 @@ class ArbitratorPlugin(object): > return ret > > last_arb_timestamp = 0 > + global counter_run, counter_time > + counter = None > while True: > if key == self.takeover_request: > if self.is_active() is False: > @@ -486,15 +504,24 @@ class ArbitratorPlugin(object): > while True: > try: > time_at_arb = self.proxy.heartbeat(self.hostname) > - if last_arb_timestamp == 0: > - last_arb_timestamp = time_at_arb > - break > - elif (time_at_arb - last_arb_timestamp) > > self.timeout: > - # VM was frozen? > - syslog.syslog('VM was frozen!') > - ret['code'] = 126 > - return ret > + if counter is not None: > + counter_run = False > + counter.join() > + if (last_arb_timestamp != 0) and \ > + (time_at_arb - last_arb_timestamp > self.timeout): > + if counter_time < self.timeout: > + syslog.syslog('VM was frozen!') > + ret['code'] = 126 > + return ret > + syslog.syslog('Arb server issue?') > + raise socket.error('Arb server issue?') > else: > + counter = threading.Thread( > + target=time_counting, > + args=(self.heartbeat_interval,)) > + counter_run = True > + counter.setDaemon(True) > + counter.start() [M] What it means to we are going to start the thread, and wait for it join() back multiple times in this while loop. [T] Yes, it's true. If you has any idea for better, I will create another ticket to update since this ticket commit is merged. > last_arb_timestamp = time_at_arb > break > except socket.error: _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel