tcp_retries2 is a global configuration parameter that affects the whole node. 
We shouldn't assume that OpenSAF is the only user of TCP on the node, so we 
should not rely on changing this parameter. TCP_USER_TIMEOUT can be set per 
socket, so if it works it would be the preferred solution.


---

** [tickets:#2014] Rebooted controller not detected in TCP**

**Status:** assigned
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Fri Sep 09, 2016 11:33 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz) 
(84.1 kB; application/x-compressed-tar)


In 20% of the cases a "reboot -f" on  controller2 is not detected and acted on. 
What is in the mds.log is .....

Sep  7  6:44:23.918566 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:23.918595 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>
Sep  7  6:44:34.018662 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout or Error 
occured
Sep  7  6:44:34.018751 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout occured on 
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep  7  6:44:34.018789 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:34.018818 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>
Sep  7  6:44:44.118832 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout or Error 
occured
Sep  7  6:44:44.118919 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout occured on 
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep  7  6:44:44.118955 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:44.118984 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>
Sep  7  6:44:54.218987 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout or Error 
occured
Sep  7  6:44:54.219085 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout occured on 
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep  7  6:44:54.219139 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:54.219168 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>

Still, there is nothing in the syslog indicating that controller2 has left the 
cluster. This is for TCP.
When the node comes back on line (without opensaf being started) controller 1 
notice finally and fail over apps. 

When the reboot is not detected the tcp keep alives stops and goes into 
retransmits instead. I have attached 2 tshark sessions captured from 
controller1, capturing traffic between controller1 and controller2. The failed 
reboot detect is captured in "ctrl2_failed_detection.trc" and for a working 
detection there is a file "ctrl2_working.trc" I have also attached all logs in 
/var/log/opensaf and the syslog (all from controller one).

It appears to me that we are hitting something similar like 
"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect";

// Jonas


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to