- **status**: review --> fixed
- **Milestone**: 5.0.1 --> 4.7.2
- **Comment**:
changeset: 8066:afddc603adcb
branch: opensaf-4.7.x
parent: 8043:4a8a00097561
user: A V Mahesh <mahesh.va...@oracle.com>
date: Thu Sep 15 10:50:31 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
changeset: 8067:efeaffca9483
branch: opensaf-5.0.x
parent: 8049:28129451fd38
user: A V Mahesh <mahesh.va...@oracle.com>
date: Thu Sep 15 10:52:03 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
changeset: 8068:87a09d9164d3
branch: opensaf-5.1.x
parent: 8065:019e617955ef
user: A V Mahesh <mahesh.va...@oracle.com>
date: Thu Sep 15 10:52:32 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
changeset: 8069:b30d5e33e50c
tag: tip
parent: 8064:99410ba8cc21
user: A V Mahesh <mahesh.va...@oracle.com>
date: Thu Sep 15 10:52:49 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** fixed
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Wed Sep 14, 2016 06:38 PM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**
-
[logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz)
(84.1 kB; application/x-compressed-tar)
-
[tcp_user_timeout_2014.patch](https://sourceforge.net/p/opensaf/tickets/2014/attachment/tcp_user_timeout_2014.patch)
(5.5 kB; application/octet-stream)
OS environment:
Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
4.4.7 kernel
Network eth0, bonded, OVS (I have tried all of them and the problem is
there in all configurations)
In 20% of the cases a "reboot -f" on controller2 is not detected and acted on.
What is in the mds.log is .....
Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error
occured
Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error
occured
Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error
occured
Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV:
Anchor=<0x0002020f,1790>
Still, there is nothing in the syslog indicating that controller2 has left the
cluster. This is for TCP.
When the node comes back on line (without opensaf being started) controller 1
notice finally and fail over apps.
When the reboot is not detected the tcp keep alives stops and goes into
retransmits instead. I have attached 2 tshark sessions captured from
controller1, capturing traffic between controller1 and controller2. The failed
reboot detect is captured in "ctrl2_failed_detection.trc" and for a working
detection there is a file "ctrl2_working.trc" I have also attached all logs in
/var/log/opensaf and the syslog (all from controller one).
It appears to me that we are hitting something similar like
"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect"
// Jonas
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets