Re: [tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-14 Thread A V Mahesh
Hi Jonas, Ok , I just pushed , please test once on 4.7 : branch: opensaf-4.7.x parent: 8043:4a8a00097561 user:A V Mahesh date:Thu Sep 15 10:50:31 2016 +0530 summary: dtm: TCP Improve node failFast with T

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-14 Thread A V Mahesh (AVM)
- **status**: review --> fixed - **Milestone**: 5.0.1 --> 4.7.2 - **Comment**: changeset: 8066:afddc603adcb branch: opensaf-4.7.x parent: 8043:4a8a00097561 user:A V Mahesh date:Thu Sep 15 10:50:31 2016 +0530 summary: dtm: TCP Improve node failFast with TCP_USER_TIM

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-14 Thread Jonas Arndt
Mahesh, Can we get this back-ported to 4.7.x as well? Cheers, // Jonas --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** review **Milestone:** 5.0.1 **Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt **Last Updated:** Wed Sep 14, 2016 04:51 AM UTC **Owner:**

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-13 Thread A V Mahesh (AVM)
- **status**: assigned --> review - **Milestone**: 4.7.2 --> 5.0.1 - **Comment**: split-brain is different issue and we have ticket #2030 to debug the split-brain case , so I published the patch of this ticket. --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** re

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-13 Thread Jonas Arndt
Anders, it is possible. I am seeing the same entry in my system when I get the split-brain. After I fixed the MAC in OVS the problem went away though. --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:2

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-13 Thread Anders Widell
Maybe your split-brain problems could be related to the ticket [#2030] that I just filed on DTM? --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt **Last Updated:** Tue Sep 13, 20

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-13 Thread Jonas Arndt
I actually need to do more tests. From the patch's point of view I think it is looking good. The split brain seems to be related to that OVS is bringing up the port with a new MAC address every time. I have run some tests on eth0 (without OVS) and not been able to reproduce the split brain. Note

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-12 Thread A V Mahesh (AVM)
- Description has changed: Diff: --- old +++ new @@ -1,3 +1,10 @@ +OS environment: + +Debian Jessie (OpenSAF is running on bare metal, no containers or VMs) +4.4.7 kernel +Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations) + + I

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-12 Thread A V Mahesh (AVM)
>>Tested the patch and ended up with split brain after 4th reboot. Both >>controllers think they are active while they can ping each other perfectly >>fine. I will try to reproduce and collect logs Can you please elaborate in which sequence of test you are ending up with split brain : 1) is a

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-12 Thread Jonas Arndt
Tested the patch and ended up with split brain after 4th reboot. Both controllers think they are active while they can ping each other perfectly fine. I will try to reproduce and collect logs --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **Milestone:**

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-12 Thread Anders Widell
The constant TCP_USER_TIMEOUT is not part of LSB, so we will anyhow need to add the following in our code: ~~~ #ifndef TCP_USER_TIMEOUT #define TCP_USER_TIMEOUT 18 #endif ~~~ We should bump the minimum required Linux version to 3.18 after introducing this fix, since the TCP_USER_TIMEOUT feature

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-12 Thread A V Mahesh (AVM)
- Attachments has changed: Diff: --- old +++ new @@ -1 +1,2 @@ logs.tgz (84.1 kB; application/x-compressed-tar) +tcp_user_timeout_2014.patch (5.5 kB; application/octet-stream) - **Comment**: Even I have Linux Kernel > 2.6.37 (3.0.13-0.27-default ) some how my system ` or ` does

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-11 Thread A V Mahesh (AVM)
I agree, the below article on `Improving HA Failures with TCP Timeouts` given more details on TCP_USER_TIMEOUT soclt option `http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html` ( please see TCP sockets option TCP_USER_TIMEOUT commit message) . --- ** [tickets:#2014] Reb

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-09 Thread Jonas Arndt
Agreed about the global nature of tcp_retried2. These parameters can be set on a socket level as well, right? Once we have the right parameter we should apply it on the sockets and not in /etc/sysctl.conf --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-09 Thread Anders Widell
tcp_retries2 is a global configuration parameter that affects the whole node. We shouldn't assume that OpenSAF is the only user of TCP on the node, so we should not rely on changing this parameter. TCP_USER_TIMEOUT can be set per socket, so if it works it would be the preferred solution. ---

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-09 Thread Anders Widell
Maybe we can use TCP_USER_TIMEOUT, as suggested here: http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:20 P

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-08 Thread A V Mahesh (AVM)
>> It appears to me that we are hitting something similar like >> >>"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive->>timer-delaying-disconnect" Have you economized below configuration in /etc/opensaf/dtmd.conf ? The above case disconnection is

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-08 Thread A V Mahesh (AVM)
- **status**: unassigned --> assigned - **assigned_to**: A V Mahesh (AVM) - **Component**: unknown --> dtm - **Part**: lib --> - - **Priority**: critical --> major - **Comment**: Can you please provide your Cluster environment ( OS / VM /container ) details --- ** [tickets:#2014] Rebooted con

[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP

2016-09-08 Thread Jonas Arndt
--- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt **Last Updated:** Thu Sep 08, 2016 06:20 PM UTC **Owner:** nobody **Attachments:** - [logs.tgz](https://sourceforge.net/p/ope