Hi Nivrutti,

Please take the change-set  and test :

============================================================

branch:      opensaf-4.7.x
parent:      8043:4a8a00097561
user:        A V Mahesh <[email protected]>
date:        Thu Sep 15 10:50:31 2016 +0530
summary:     dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

============================================================
-AVM


On 9/15/2016 7:51 PM, Anders Widell wrote:
> Yes we were experimenting with the tcp_retries2 option, but the solution
> we ended up with was to use the TCP_USER_TIMEOUT socket option.
>
> regards,
>
> Anders Widell
>
>
> On 09/15/2016 03:13 PM, Nivrutti Kale wrote:
>> Hi,
>>
>> There is one way to improve the detection time. You can change the " 
>> net.ipv4.tcp_retries2"  value to 3.
>> Default value of " net.ipv4.tcp_retries2" is 15.
>>
>> Thanks,
>> Nivrutti
>>
>> -----Original Message-----
>> From: Mathivanan Naickan Palanivelu [mailto:[email protected]]
>> Sent: Thursday, September 15, 2016 6:38 PM
>> To: Shu Wang <[email protected]>; [email protected]
>> Subject: Re: [users] how long it takes to detect node sudden power
>>
>> Hi,
>>
>> You could try the fix in this ticket 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_tickets_2014_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=gSGrK2pteB9mnPgovHNo3qsOXF0w9s77wt4nUXOHt4o&e=
>>   and see if the scenario is the same The patch In 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_opensaf_staging_ci_b30d5e33e50c7eea8cc1730cbe0a0dde572621f0_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=UTa3tlpHkkLFWQGUlegcxS3Y6JFlHiW2Yfx1bCbKcTM&e=
>>
>> Thanks,
>> Mathi.
>>
>>
>>> -----Original Message-----
>>> From: Shu Wang [mailto:[email protected]]
>>> Sent: Saturday, June 20, 2015 1:50 AM
>>> To: [email protected]
>>> Subject: Re: [users] how long it takes to detect node sudden power
>>>
>>> We have a similar scenario. One of our payload node rebooted, it took
>>> from a few seconds to a few minutes for other nodes to detect the node
>>> loss. Since it took the master controller a few minutes to detect the
>>> node loss and reacted to the loss, this caused serious problems and
>>> many service units went bad. Is there anyway to improve the detection time?
>>>
>>> Thank you!
>>>
>>> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917|
>>> www.NetCracker.com Proven Partner to Communications Service Providers
>>>
>>> -----Original Message-----
>>> Message: 3
>>> Date: Tue, 14 Apr 2015 09:58:51 +0000
>>> From: Yao Cheng LIANG <[email protected]>
>>> Subject: Re: [users] how long it takes to detect node sudden power
>>>           loss
>>> To: 'A V Mahesh' <[email protected]>, Mathivanan Naickan
>>>           Palanivelu      <[email protected]>
>>> Cc: "[email protected]"
>>>           <[email protected]>
>>> Message-ID: <285F6C4AD3FBC04EBAE1D68203EA87F20B037F25@asdag1>
>>> Content-Type: text/plain; charset="windows-1255"
>>>
>>> Let me give more info about my setup:
>>>
>>>
>>> 1.       I have two node, running as controller
>>>
>>> 2.       Besides OpenSAF service, I have another service unit with three
>>> component in it
>>>
>>> 3.       These components use Checkpoint service to data synchronization
>>>
>>>
>>>
>>> My dtmd.conf is as below:
>>>
>>> ?
>>>
>>> DTM_INI_DIS_TIMEOUT_SECS=5
>>>
>>>
>>>
>>> DTM_TCP_KEEPIDLE_TIME=2
>>>
>>>
>>>
>>> DTM_TCP_KEEPALIVE_INTVL=1
>>>
>>>
>>>
>>> DTM_TCP_KEEPALIVE_PROBES=2
>>>
>>>
>>>
>>> I read the code and found it is using TCP keepalive to detect failure
>>> of peer node. While keepalive packet will not be send until some time
>>> after the link is IDLE. I think the issue is here. Suppose ?standby?
>>> node is sending something to ?active? node, while at this time ?active? 
>>> node is rebooted, ?standby?
>>> node will keeping sending this until it reaches maximum retries. In
>>> this period, the link will not be idel, thus the keepalive mechanism
>>> will not start to work. This may cause ?standby? node long time to detect 
>>> failure of ?active?
>>> node.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Ted
>>>
>>>
>>>
>>>
>>>
>>> From: A V Mahesh [mailto:[email protected]]
>>> Sent: Monday, April 13, 2015 10:06 PM
>>> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu
>>> Cc: [email protected]
>>> Subject: Re: [users] how long it takes to detect node sudden power
>>> loss
>>>
>>> Hi,
>>>
>>> Un-comment the below line to enable trace of osafdtm in
>>> /etc/opensaf/dtmd.conf
>>>
>>> #args="--tracemask=0xffffffff"   ------>  args="--tracemask=0xffffffff"
>>>
>>> And do  `export MDS_LOG_LEVEL=5` on both node consoles before
>>> `/etc/init.d/opensafd restart` to get debuig MDS logs.
>>>
>>>
>>> -AVM
>>>
>>> On 4/13/2015 11:52 AM, Yao Cheng LIANG wrote:
>>> Dear AVM,
>>>
>>> Thanks. But I need to add ?args="--loglevel=info"? to dtmd.conf so
>>> that /var/log/opensaf/osafdtm and /var/log/opensaf/mds.log can be seen, 
>>> right?
>>>
>>> Ted
>>>
>>> From: A V Mahesh [mailto:[email protected]]
>>> Sent: Monday, April 13, 2015 1:03 PM
>>> To: Yao Cheng LIANG; Mathivanan Naickan Palanivelu
>>> Cc: [email protected]<mailto:opensaf-
>>> [email protected]>
>>> Subject: Re: [users] how long it takes to detect node sudden power
>>> loss
>>>
>>> Hi Ted,
>>>
>>> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote:
>>> I did 3o times rebooting ?standby? node, and found two times it needs
>>> 1~2 minutes for the ?active? node to detect it
>>>
>>> Can you please share the  following data of both nodes when ?active?
>>> node detection of standby taken 1~2 minutes.
>>>
>>> 1) #/var/log/opensaf/osafdtm
>>> 2) #/var/log/opensaf/mds.log
>>> 3) #/var/log/messages ( syslog )
>>>
>>> 4) #top    (output at the time of detection)
>>> 5) /etc/opensaf/dtmd.conf
>>>
>>> -AVM
>>>
>>> On 4/10/2015 3:54 PM, Yao Cheng LIANG wrote:
>>> I did some tests recently. I have two controllers, and I reboot one
>>> and see how long the second could detect failure of the peer. I did 3o
>>> times rebooting ?standby? node, and found two times it needs 1~2
>>> minutes for the ?active? node to detect it. Could you anyone tell me
>>> the reason and the solution?
>>>
>>> Thanks.
>>>
>>> Ted
>>>
>>> Sent from Windows Mail
>>>
>>> From: Mathivanan Naickan Palanivelu<mailto:[email protected]>
>>> Sent: ?Thursday?, ?April? ?9?, ?2015 ?7?:?39? ?PM
>>> To: Yao Cheng LIANG<mailto:[email protected]>
>>> Cc: [email protected]<mailto:opensaf-
>>> [email protected]>, 'A V
>>> Mahesh'<mailto:[email protected]>
>>>
>>> I think since these are TCP keepalive configuration values, the
>>> connection loss would be detected immediatey in the cases of abrupt
>>> powershutdown or cable unplug.
>>>
>>> Thanks,
>>> Mathi.
>>>
>>> ----- [email protected]<mailto:[email protected]> wrote:
>>>
>>>> Is there any approach to hasten this detection, because 4 seconds is
>>>> too long for some use cases?
>>>>
>>>> Br,
>>>>
>>>> Ted
>>>>
>>>> -----Original Message-----
>>>> From: A V Mahesh [mailto:[email protected]]
>>>> Sent: Monday, March 30, 2015 12:29 PM
>>>> To:
>>>> [email protected]<mailto:[email protected]
>>>> ef
>>>> orge.net>
>>>> Subject: Re: [users] how long it takes to detect node sudden power
>>>> loss
>>>>
>>>> Hi,
>>>>
>>>>    >>Does that mean it needs 2 + 2*1 = 4s before the peer can detect
>>>> the node connection loss if I suddenly unplug power supply of one node?
>>>> Yes,when the connection goes down (  disconnect the cable/unplug
>>>> power supply )  in 4 seconds detect that the connection has been
>>>> lost
>>>>
>>>>     -AVM
>>>>
>>>> On 3/29/2015 7:11 PM, Yao Cheng LIANG wrote:
>>>>> Dear all,
>>>>>
>>>>> If using tcp, the underlying dtms using tcp keepalive to detect
>>>> connection loss. If my dtmd.conf is as below:
>>>>> DTM_TCP_KEEPIDLE_TIME=2
>>>>>
>>>>> DTM_TCP_KEEPALIVE_INTVL=1
>>>>>
>>>>> DTM_TCP_KEEPALIVE_PROBES=2
>>>>>
>>>>> Does that mean it needs 2 + 2*1 = 4s before the peer can detect
>>>>> the
>>>> node connection loss if I suddenly unplug power supply of one node?
>>>>> Thanks.
>>>>>
>>>>> Ted
>>>>>
>>>>>
>>>> --------------------------------------------------------------------
>>>> --
>>>>> -------- Dive into the World of Parallel Programming The Go
>>>>> Parallel
>>>>> Website, sponsored by Intel and developed in partnership with
>>>> Slashdot
>>>>> Media, is your hub for all things parallel software development,
>>>> from
>>>>> weekly thought leadership blogs to news, videos, case studies,
>>>>> tutorials and more. Take a look and join the conversation now.
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sou
>>>>> rceforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N6
>>>>> 7rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI
>>>>> -fnO-gw&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e=
>>>>> _______________________________________________
>>>>> Opensaf-users mailing list
>>>>> [email protected]<mailto:[email protected]
>>>>> rc
>>>>> eforge.net>
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcef
>>>>> orge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqI
>>>>> Ni2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOB
>>>>> BSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDf
>>>>> hJtPItghKLab0&e=
>>>> --------------------------------------------------------------------
>>>> --
>>>> -------- Dive into the World of Parallel Programming The Go Parallel
>>>> Website, sponsored by Intel and developed in partnership with
>>>> Slashdot Media, is your hub for all things parallel software
>>>> development, from weekly thought leadership blogs to news, videos,
>>>> case studies, tutorials and more. Take a look and join the conversation 
>>>> now.
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc
>>>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE
>>>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g
>>>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e=
>>>> _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]<mailto:[email protected]
>>>> ef orge.net>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor
>>>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j
>>>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P
>>>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh
>>>> KLab0&e=
>>>>
>>>> --------------------------------------------------------------------
>>>> --
>>>> -------- Dive into the World of Parallel Programming The Go Parallel
>>>> Website, sponsored by Intel and developed in partnership with
>>>> Slashdot Media, is your hub for all things parallel software
>>>> development, from weekly thought leadership blogs to news, videos,
>>>> case studies, tutorials and more. Take a look and join the conversation 
>>>> now.
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__goparallel.sourc
>>>> eforge.net_&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXE
>>>> xkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-g
>>>> w&s=Zs7HfD3qAmpaCItVfMRUxsDZoQG2omqLC_2-ifs5Kxw&e=
>>>> _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]<mailto:[email protected]
>>>> ef orge.net>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourcefor
>>>> ge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2j
>>>> Tzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5P
>>>> RfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItgh
>>>> KLab0&e=
>>> ------------------------------
>>>
>>>
>>>
>>> ________________________________
>>> The information transmitted herein is intended only for the person or
>>> entity to which it is addressed and may contain confidential,
>>> proprietary and/or privileged material. Any review, retransmission,
>>> dissemination or other use of, or taking of any action in reliance
>>> upon, this information by persons or entities other than the intended
>>> recipient is prohibited. If you received this in error, please contact the 
>>> sender and delete the material from any computer.
>>>
>>> ----------------------------------------------------------------------
>>> -------- _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge
>>> .net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&
>>> r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpf
>>> RXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e=
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Dusers&d=DQICAg&c=IL_XqQWOjubgfqINi2jTzg&r=8oj2Tn7_JuMy90N67rXExkWsx29-JTWbXUkT3IIi99w&m=DetywC0rOBBSwA5PRfrcpfRXAyGliPduaCiI-fnO-gw&s=7eqTbeBNi29xHoYbFSxSInV7UyTiDfhJtPItghKLab0&e=
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users


------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to