Hi Nivrutti, Please take the change-set and test :
============================================================ branch: opensaf-4.7.x parent: 8043:4a8a00097561 user: A V Mahesh <[email protected]> date: Thu Sep 15 10:50:31 2016 +0530 summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014] ============================================================ -AVM On 9/15/2016 7:51 PM, Anders Widell wrote: > Yes we were experimenting with the tcp_retries2 option, but the solution > we ended up with was to use the TCP_USER_TIMEOUT socket option. > > regards, > > Anders Widell > > > On 09/15/2016 03:13 PM, Nivrutti Kale wrote: >> Hi, >> >> There is one way to improve the detection time. You can change the " >> net.ipv4.tcp_retries2" value to 3. >> Default value of " net.ipv4.tcp_retries2" is 15. >> >> Thanks, >> Nivrutti >> >> -----Original Message----- >> From: Mathivanan Naickan Palanivelu [mailto:[email protected]] >> Sent: Thursday, September 15, 2016 6:38 PM >> To: Shu Wang <[email protected]>; [email protected] >> Subject: Re: [users] how long it takes to detect node sudden power >> >> Hi, >> >> You could try the fix in this ticket >> https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_tickets_2014_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=gSGrK2pteB9mnPgovHNo3qsOXF0w9s77wt4nUXOHt4o&e= >> and see if the scenario is the same The patch In >> https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_staging_ci_b30d5e33e50c7eea8cc1730cbe0a0dde572621f0_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=UTa3tlpHkkLFWQGUlegcxS3Y6JFlHiW2Yfx1bCbKcTM&e= >> >> Thanks, >> Mathi. >> >> >>> -----Original Message----- >>> From: Shu Wang [mailto:[email protected]] >>> Sent: Saturday, June 20, 2015 1:50 AM >>> To: [email protected] >>> Subject: Re: [users] how long it takes to detect node sudden power >>> >>> We have a similar scenario. One of our payload node rebooted, it took >>> from a few seconds to a few minutes for other nodes to detect the node >>> loss. Since it took the master controller a few minutes to detect the >>> node loss and reacted to the loss, this caused serious problems and >>> many service units went bad. Is there anyway to improve the detection time? >>> >>> Thank you! >>> >>> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| >>> www.NetCracker.com Proven Partner to Communications Service Providers >>> >>> -----Original Message----- >>> Message: 3 >>> Date: Tue, 14 Apr 2015 09:58:51 +0000 >>> From: Yao Cheng LIANG <[email protected]> >>> Subject: Re: [users] how long it takes to detect node sudden power >>> loss >>> To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan >>> Palanivelu <[email protected]> >>> Cc: "[email protected]" >>> <[email protected]> >>> Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1> >>> Content-Type: text/plain; charset="windows-1255" >>> >>> Let me give more info about my setup: >>> >>> >>> 1. I have two node, running as controller >>> >>> 2. Besides OpenSAF service, I have another service unit with three >>> component in it >>> >>> 3. These components use Checkpoint service to data synchronization >>> >>> >>> >>> My dtmd.conf is as below: >>> >>> ? >>> >>> DTM_INI_DIS_TIMEOUT_SECS=5 >>> >>> >>> >>> DTM_TCP_KEEPIDLE_TIME=2 >>> >>> >>> >>> DTM_TCP_KEEPALIVE_INTVL=1 >>> >>> >>> >>> DTM_TCP_KEEPALIVE_PROBES=2 >>> >>> >>> >>> I read the code and found it is using TCP keepalive to detect failure >>> of peer node. While keepalive packet will not be send until some time >>> after the link is IDLE. I think the issue is here. Suppose ?standby? >>> node is sending something to ?active? node, while at this time ?active? >>> node is rebooted, ?standby? >>> node will keeping sending this until it reaches maximum retries. In >>> this period, the link will not be idel, thus the keepalive mechanism >>> will not start to work. This may cause ?standby? node long time to detect >>> failure of ?active? >>> node. >>> >>> Thanks. >>> >>> >>> >>> Ted >>> >>> >>> >>> >>> >>> From: A V Mahesh [mailto:[email protected]] >>> Sent: Monday, April 13, 2015 10:06 PM >>> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu >>> Cc: [email protected] >>> Subject: Re: [users] how long it takes to detect node sudden power >>> loss >>> >>> Hi, >>> >>> Un-comment the below line to enable trace of osafdtm in >>> /etc/opensaf/dtmd.conf >>> >>> #args="--tracemask=0xffffffff" ------> args="--tracemask=0xffffffff" >>> >>> And do `export MDS_LOG_LEVEL=5` on both node consoles before >>> `/etc/init.d/opensafd restart` to get debuig MDS logs. >>> >>> >>> -AVM >>> >>> On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote: >>> Dear AVM, >>> >>> Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so >>> that /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, >>> right? >>> >>> Ted >>> >>> From: A V Mahesh [mailto:[email protected]] >>> Sent: Monday, April 13, 2015 1:03 PM >>> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu >>> Cc: [email protected]<mailto:opensaf- >>> [email protected]> >>> Subject: Re: [users] how long it takes to detect node sudden power >>> loss >>> >>> Hi Ted, >>> >>> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: >>> I did 3o times rebooting ?standby? node, and found two times it needs >>> 1~2 minutes for the ?active? node to detect it >>> >>> Can you please share the following data of both nodes when ?active? >>> node detection of standby taken 1~2 minutes. >>> >>> 1) #/var/log/opensaf/osafdtm >>> 2) #/var/log/opensaf/mds.log >>> 3) #/var/log/messages ( syslog ) >>> >>> 4) #top (output at the time of detection) >>> 5) /etc/opensaf/dtmd.conf >>> >>> -AVM >>> >>> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote: >>> I did some tests recently. I have two controllers, and I reboot one >>> and see how long the second could detect failure of the peer. I did 3o >>> times rebooting ?standby? node, and found two times it needs 1~2 >>> minutes for the ?active? node to detect it. Could you anyone tell me >>> the reason and the solution? >>> >>> Thanks. >>> >>> Ted >>> >>> Sent from Windows Mail >>> >>> From: Mathivanan Naickan Palanivelu<mailto:[email protected]> >>> Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM >>> To: Yao Cheng LIANG<mailto:[email protected]> >>> Cc: [email protected]<mailto:opensaf- >>> [email protected]>, 'A V >>> Mahesh'<mailto:[email protected]> >>> >>> I think since these are TCP keepalive configuration values, the >>> connection loss would be detected immediatey in the cases of abrupt >>> powershutdown or cable unplug. >>> >>> Thanks, >>> Mathi. >>> >>> ----- [email protected]<mailto:[email protected]> wrote: >>> >>>> Is there any approach to hasten this detection, because 4 seconds is >>>> too long for some use cases? >>>> >>>> Br, >>>> >>>> Ted >>>> >>>> -----Original Message----- >>>> From: A V Mahesh [mailto:[email protected]] >>>> Sent: Monday, March 30, 2015 12:29 PM >>>> To: >>>> [email protected]<mailto:[email protected] >>>> ef >>>> orge.net> >>>> Subject: Re: [users] how long it takes to detect node sudden power >>>> loss >>>> >>>> Hi, >>>> >>>> >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect >>>> the node connection loss if I suddenly unplug power supply of one node? >>>> Yes,when the connection goes down ( disconnect the cable/unplug >>>> power supply ) in 4 seconds detect that the connection has been >>>> lost >>>> >>>> -AVM >>>> >>>> On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote: >>>>> Dear all, >>>>> >>>>> If using tcp, the underlying dtms using tcp keepalive to detect >>>> connection loss. If my dtmd.conf is as below: >>>>> DTM_TCP_KEEPIDLE_TIME=2 >>>>> >>>>> DTM_TCP_KEEPALIVE_INTVL=1 >>>>> >>>>> DTM_TCP_KEEPALIVE_PROBES=2 >>>>> >>>>> Does that mean it needs 2 + 2*1 = 4s before the peer can detect >>>>> the >>>> node connection loss if I suddenly unplug power supply of one node? >>>>> Thanks. >>>>> >>>>> Ted >>>>> >>>>> >>>> -------------------------------------------------------------------- >>>> -- >>>>> -------- Dive into the World of Parallel Programming The Go >>>>> Parallel >>>>> Website, sponsored by Intel and developed in partnership with >>>> Slashdot >>>>> Media, is your hub for all things parallel software development, >>>> from >>>>> weekly thought leadership blogs to news, videos, case studies, >>>>> tutorials and more. Take a look and join the conversation now. >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sou >>>>> rceforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N6 >>>>> 7rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI >>>>> -fnO-gw&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= >>>>> _______________________________________________ >>>>> Opensaf-users mailing list >>>>> [email protected]<mailto:[email protected] >>>>> rc >>>>> eforge.net> >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcef >>>>> orge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqI >>>>> Ni2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOB >>>>> BSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDf >>>>> hJtPItghKLab0&e= >>>> -------------------------------------------------------------------- >>>> -- >>>> -------- Dive into the World of Parallel Programming The Go Parallel >>>> Website, sponsored by Intel and developed in partnership with >>>> Slashdot Media, is your hub for all things parallel software >>>> development, from weekly thought leadership blogs to news, videos, >>>> case studies, tutorials and more. Take a look and join the conversation >>>> now. >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc >>>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE >>>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g >>>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= >>>> _______________________________________________ >>>> Opensaf-users mailing list >>>> [email protected]<mailto:[email protected] >>>> ef orge.net> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor >>>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j >>>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P >>>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh >>>> KLab0&e= >>>> >>>> -------------------------------------------------------------------- >>>> -- >>>> -------- Dive into the World of Parallel Programming The Go Parallel >>>> Website, sponsored by Intel and developed in partnership with >>>> Slashdot Media, is your hub for all things parallel software >>>> development, from weekly thought leadership blogs to news, videos, >>>> case studies, tutorials and more. Take a look and join the conversation >>>> now. >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc >>>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE >>>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g >>>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e= >>>> _______________________________________________ >>>> Opensaf-users mailing list >>>> [email protected]<mailto:[email protected] >>>> ef orge.net> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor >>>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j >>>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P >>>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh >>>> KLab0&e= >>> ------------------------------ >>> >>> >>> >>> ________________________________ >>> The information transmitted herein is intended only for the person or >>> entity to which it is addressed and may contain confidential, >>> proprietary and/or privileged material. Any review, retransmission, >>> dissemination or other use of, or taking of any action in reliance >>> upon, this information by persons or entities other than the intended >>> recipient is prohibited. If you received this in error, please contact the >>> sender and delete the material from any computer. >>> >>> ---------------------------------------------------------------------- >>> -------- _______________________________________________ >>> Opensaf-users mailing list >>> [email protected] >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge >>> .net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg& >>> r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpf >>> RXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e= >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e= >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-users >> > > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
