The constant TCP_USER_TIMEOUT is not part of LSB, so we will anyhow need to add
the following in our code:
~~~
#ifndef TCP_USER_TIMEOUT
#define TCP_USER_TIMEOUT 18
#endif
~~~
We should bump the minimum required Linux version to 3.18 after introducing
this fix, since the TCP_USER_TIMEOUT feature didn't work properly in earlier
Linux versions according to the article *Improving HA Failures with TCP
Timeouts* you referred to.
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** assigned
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Mon Sep 12, 2016 07:10 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**
-
[logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz)
(84.1 kB; application/x-compressed-tar)
-
[tcp_user_timeout_2014.patch](https://sourceforge.net/p/opensaf/tickets/2014/attachment/tcp_user_timeout_2014.patch)
(5.5 kB; application/octet-stream)
In 20% of the cases a "reboot -f" on controller2 is not detected and acted on.
What is in the mds.log is .....
Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error
occured
Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error
occured
Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error
occured
Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Still, there is nothing in the syslog indicating that controller2 has left the
cluster. This is for TCP.
When the node comes back on line (without opensaf being started) controller 1
notice finally and fail over apps.
When the reboot is not detected the tcp keep alives stops and goes into
retransmits instead. I have attached 2 tshark sessions captured from
controller1, capturing traffic between controller1 and controller2. The failed
reboot detect is captured in "ctrl2_failed_detection.trc" and for a working
detection there is a file "ctrl2_working.trc" I have also attached all logs in
/var/log/opensaf and the syslog (all from controller one).
It appears to me that we are hitting something similar like
"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect"
// Jonas
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets