Hi, There is one way to improve the detection time. You can change the " net.ipv4.tcp_retries2" value to 3. Default value of " net.ipv4.tcp_retries2" is 15.
Thanks, Nivrutti -----Original Message----- From: Mathivanan Naickan Palanivelu [mailto:[email protected]] Sent: Thursday, September 15, 2016 6:38 PM To: Shu Wang <[email protected]>; [email protected] Subject: Re: [users] how long it takes to detect node sudden power Hi, You could try the fix in this ticket https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_tickets_2014_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=gSGrK2pteB9mnPgovHNo3qsOXF0w9s77wt4nUXOHt4o&e= and see if the scenario is the same The patch In https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_staging_ci_b30d5e33e50c7eea8cc1730cbe0a0dde572621f0_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=UTa3tlpHkkLFWQGUlegcxS3Y6JFlHiW2Yfx1bCbKcTM&e= Thanks, Mathi. > -----Original Message----- > From: Shu Wang [mailto:[email protected]] > Sent: Saturday, June 20, 2015 1:50 AM > To: [email protected] > Subject: Re: [users] how long it takes to detect node sudden power > > We have a similar scenario. One of our payload node rebooted, it took > from a few seconds to a few minutes for other nodes to detect the node > loss. Since it took the master controller a few minutes to detect the > node loss and reacted to the loss, this caused serious problems and > many service units went bad. Is there anyway to improve the detection time? > > Thank you! > > Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| > www.NetCracker.com Proven Partner to Communications Service Providers > > -----Original Message----- > Message: 3 > Date: Tue, 14 Apr 2015 09:58:51 +0000 > From: Yao Cheng LIANG <[email protected]> > Subject: Re: [users] how long it takes to detect node sudden power > loss > To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan > Palanivelu <[email protected]> > Cc: "[email protected]" > <[email protected]> > Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1> > Content-Type: text/plain; charset="windows-1255" > > Let me give more info about my setup: > > > 1. I have two node, running as controller > > 2. Besides OpenSAF service, I have another service unit with three > component in it > > 3. These components use Checkpoint service to data synchronization > > > > My dtmd.conf is as below: > > ? > > DTM_INI_DIS_TIMEOUT_SECS=5 > > > > DTM_TCP_KEEPIDLE_TIME=2 > > > > DTM_TCP_KEEPALIVE_INTVL=1 > > > > DTM_TCP_KEEPALIVE_PROBES=2 > > > > I read the code and found it is using TCP keepalive to detect failure > of peer node. While keepalive packet will not be send until some time > after the link is IDLE. I think the issue is here. Suppose ?standby? > node is sending something to ?active? node, while at this time ?active? node > is rebooted, ?standby? > node will keeping sending this until it reaches maximum retries. In > this period, the link will not be idel, thus the keepalive mechanism > will not start to work. This may cause ?standby? node long time to detect > failure of ?active? > node. > > Thanks. > > > > Ted > > > > > > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, April 13, 2015 10:06 PM > To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu > Cc: [email protected] > Subject: Re: [users] how long it takes to detect node sudden power > loss > > Hi, > > Un-comment the below line to enable trace of osafdtm in > /etc/opensaf/dtmd.conf > > #args="--tracemask=0xffffffff" ------> args="--tracemask=0xffffffff" > > And do `export MDS_LOG_LEVEL=5` on both node consoles before > `/etc/init.d/opensafd restart` to get debuig MDS logs. > > > -AVM > > On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote: > Dear AVM, > > Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so > that /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, right? > > Ted > > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, April 13, 2015 1:03 PM > To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu > Cc: [email protected]<mailto:opensaf- > [email protected]> > Subject: Re: [users] how long it takes to detect node sudden power > loss > > Hi Ted, > > On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: > I did 3o times rebooting ?standby? node, and found two times it needs > 1~2 minutes for the ?active? node to detect it > > Can you please share the following data of both nodes when ?active? > node detection of standby taken 1~2 minutes. > > 1) #/var/log/opensaf/osafdtm > 2) #/var/log/opensaf/mds.log > 3) #/var/log/messages ( syslog ) > > 4) #top (output at the time of detection) > 5) /etc/opensaf/dtmd.conf > > -AVM > > On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: > I did some tests recently. I have two controllers, and I reboot one > and see how long the second could detect failure of the peer. I did 3o > times rebooting ?standby? node, and found two times it needs 1~2 > minutes for the ?active? node to detect it. Could you anyone tell me > the reason and the solution? > > Thanks. > > Ted > > Sent from Windows Mail > > From: Mathivanan Naickan Palanivelu<mailto:[email protected]> > Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM > To: Yao Cheng LIANG<mailto:[email protected]> > Cc: [email protected]<mailto:opensaf- > [email protected]>, 'A V > Mahesh'<mailto:[email protected]> > > I think since these are TCP keepalive configuration values, the > connection loss would be detected immediatey in the cases of abrupt > powershutdown or cable unplug. > > Thanks, > Mathi. > > ----- [email protected]<mailto:[email protected]> wrote: > > > Is there any approach to hasten this detection, because 4 seconds is > > too long for some use cases? > > > > Br, > > > > Ted > > > > -----Original Message----- > > From: A V Mahesh [mailto:[email protected]] > > Sent: Monday, March 30, 2015 12:29 PM > > To: > > [email protected]<mailto:[email protected] > > ef > > orge.net> > > Subject: Re: [users] how long it takes to detect node sudden power > > loss > > > > Hi, > > > > >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect > > the node connection loss if I suddenly unplug power supply of one node? > > Yes,when the connection goes down ( disconnect the cable/unplug > > power supply ) in 4 seconds detect that the connection has been > > lost > > > > -AVM > > > > On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote: > > > Dear all, > > > > > > If using tcp, the underlying dtms using tcp keepalive to detect > > connection loss. If my dtmd.conf is as below: > > > > > > DTM_TCP_KEEPIDLE_TIME=2 > > > > > > DTM_TCP_KEEPALIVE_INTVL=1 > > > > > > DTM_TCP_KEEPALIVE_PROBES=2 > > > > > > Does that mean it needs 2 + 2*1 = 4s before the peer can detect > > > the > > node connection loss if I suddenly unplug power supply of one node? > > > > > > Thanks. > > > > > > Ted > > > > > > > > -------------------------------------------------------------------- > > -- > > > -------- Dive into the World of Parallel Programming The Go > > > Parallel > > > > > Website, sponsored by Intel and developed in partnership with > > Slashdot > > > Media, is your hub for all things parallel software development, > > from > > > weekly thought leadership blogs to news, videos, case studies, > > > tutorials and more. Take a look and join the conversation now. > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sou > > > rceforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N6 > > > 7rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI > > > -fnO-gw&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= > > > _______________________________________________ > > > Opensaf-users mailing list > > > [email protected]<mailto:[email protected] > > > rc > > > eforge.net> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcef > > > orge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqI > > > Ni2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOB > > > BSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDf > > > hJtPItghKLab0&e= > > > > > > -------------------------------------------------------------------- > > -- > > -------- Dive into the World of Parallel Programming The Go Parallel > > Website, sponsored by Intel and developed in partnership with > > Slashdot Media, is your hub for all things parallel software > > development, from weekly thought leadership blogs to news, videos, > > case studies, tutorials and more. Take a look and join the conversation now. > > https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc > > eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE > > xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g > > w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= > > _______________________________________________ > > Opensaf-users mailing list > > [email protected]<mailto:[email protected] > > ef orge.net> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor > > ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j > > Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P > > RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh > > KLab0&e= > > > > -------------------------------------------------------------------- > > -- > > -------- Dive into the World of Parallel Programming The Go Parallel > > Website, sponsored by Intel and developed in partnership with > > Slashdot Media, is your hub for all things parallel software > > development, from weekly thought leadership blogs to news, videos, > > case studies, tutorials and more. Take a look and join the conversation now. > > https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc > > eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE > > xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g > > w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= > > _______________________________________________ > > Opensaf-users mailing list > > [email protected]<mailto:[email protected] > > ef orge.net> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor > > ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j > > Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P > > RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh > > KLab0&e= > > ------------------------------ > > > > ________________________________ > The information transmitted herein is intended only for the person or > entity to which it is addressed and may contain confidential, > proprietary and/or privileged material. Any review, retransmission, > dissemination or other use of, or taking of any action in reliance > upon, this information by persons or entities other than the intended > recipient is prohibited. If you received this in error, please contact the > sender and delete the material from any computer. > > ---------------------------------------------------------------------- > -------- _______________________________________________ > Opensaf-users mailing list > [email protected] > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge > .net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg& > r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpf > RXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e= ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e= ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
