Hi Jonas,

Ok , I just pushed , please test once on 4.7 :

============================================================

branch:      opensaf-4.7.x
parent:      8043:4a8a00097561
user:        A V Mahesh <mahesh.va...@oracle.com>
date:        Thu Sep 15 10:50:31 2016 +0530
summary:     dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

============================================================

-AVM

On 9/15/2016 12:08 AM, Jonas Arndt wrote:

Mahesh,

Can we get this back-ported to 4.7.x as well?

Cheers,

// Jonas

------------------------------------------------------------------------

*[tickets:#2014] <https://sourceforge.net/p/opensaf/tickets/2014/> Rebooted controller not detected in TCP*

*Status:* review
*Milestone:* 5.0.1
*Created:* Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
*Last Updated:* Wed Sep 14, 2016 04:51 AM UTC
*Owner:* A V Mahesh (AVM)
*Attachments:*

  * logs.tgz
    <https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz>
    (84.1 kB; application/x-compressed-tar)
  * tcp_user_timeout_2014.patch
    
<https://sourceforge.net/p/opensaf/tickets/2014/attachment/tcp_user_timeout_2014.patch>
    (5.5 kB; application/octet-stream)

OS environment:

Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
4.4.7 kernel
Network eth0, bonded, OVS (I have tried all of them and the problem is there in 
all configurations)

In 20% of the cases a "reboot -f" on controller2 is not detected and acted on. What is in the mds.log is .....

Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1> Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1> Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1> Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1> Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>

Still, there is nothing in the syslog indicating that controller2 has left the cluster. This is for TCP. When the node comes back on line (without opensaf being started) controller 1 notice finally and fail over apps.

When the reboot is not detected the tcp keep alives stops and goes into retransmits instead. I have attached 2 tshark sessions captured from controller1, capturing traffic between controller1 and controller2. The failed reboot detect is captured in "ctrl2_failed_detection.trc" and for a working detection there is a file "ctrl2_working.trc" I have also attached all logs in /var/log/opensaf and the syslog (all from controller one).

It appears to me that we are hitting something similar like "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect";

// Jonas

------------------------------------------------------------------------

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.



------------------------------------------------------------------------------


_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to