Hi Jonas,
Ok , I just pushed , please test once on 4.7 :
branch: opensaf-4.7.x
parent: 8043:4a8a00097561
user:A V Mahesh
date:Thu Sep 15 10:50:31 2016 +0530
summary: dtm: TCP Improve node failFast with T
- **status**: review --> fixed
- **Milestone**: 5.0.1 --> 4.7.2
- **Comment**:
changeset: 8066:afddc603adcb
branch: opensaf-4.7.x
parent: 8043:4a8a00097561
user:A V Mahesh
date:Thu Sep 15 10:50:31 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIM
Mahesh,
Can we get this back-ported to 4.7.x as well?
Cheers,
// Jonas
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** review
**Milestone:** 5.0.1
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Wed Sep 14, 2016 04:51 AM UTC
**Owner:**
- **status**: assigned --> review
- **Milestone**: 4.7.2 --> 5.0.1
- **Comment**:
split-brain is different issue and we have ticket #2030 to debug the
split-brain case ,
so I published the patch of this ticket.
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** re
Anders, it is possible. I am seeing the same entry in my system when I get the
split-brain.
After I fixed the MAC in OVS the problem went away though.
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** assigned
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:2
Maybe your split-brain problems could be related to the ticket [#2030] that I
just filed on DTM?
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** assigned
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Tue Sep 13, 20
I actually need to do more tests. From the patch's point of view I think it is
looking good. The split brain seems to be related to that OVS is bringing up
the port with a new MAC address every time. I have run some tests on eth0
(without OVS) and not been able to reproduce the split brain. Note
- Description has changed:
Diff:
--- old
+++ new
@@ -1,3 +1,10 @@
+OS environment:
+
+Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
+4.4.7 kernel
+Network eth0, bonded, OVS (I have tried all of them and the problem is
there in all configurations)
+
+
I
>>Tested the patch and ended up with split brain after 4th reboot. Both
>>controllers think they are active while they can ping each other perfectly
>>fine. I will try to reproduce and collect logs
Can you please elaborate in which sequence of test you are ending up with split
brain :
1) is a
Tested the patch and ended up with split brain after 4th reboot. Both
controllers think they are active while they can ping each other perfectly
fine. I will try to reproduce and collect logs
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** assigned
**Milestone:**
The constant TCP_USER_TIMEOUT is not part of LSB, so we will anyhow need to add
the following in our code:
~~~
#ifndef TCP_USER_TIMEOUT
#define TCP_USER_TIMEOUT 18
#endif
~~~
We should bump the minimum required Linux version to 3.18 after introducing
this fix, since the TCP_USER_TIMEOUT feature
- Attachments has changed:
Diff:
--- old
+++ new
@@ -1 +1,2 @@
logs.tgz (84.1 kB; application/x-compressed-tar)
+tcp_user_timeout_2014.patch (5.5 kB; application/octet-stream)
- **Comment**:
Even I have Linux Kernel > 2.6.37 (3.0.13-0.27-default ) some how my system
` or ` does
I agree, the below article on `Improving HA Failures with TCP Timeouts` given
more details on TCP_USER_TIMEOUT soclt option
`http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html` (
please see TCP sockets option TCP_USER_TIMEOUT commit message) .
---
** [tickets:#2014] Reb
Agreed about the global nature of tcp_retried2. These parameters can be set on
a socket level as well, right? Once we have the right parameter we should apply
it on the sockets and not in /etc/sysctl.conf
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** assigned
**
tcp_retries2 is a global configuration parameter that affects the whole node.
We shouldn't assume that OpenSAF is the only user of TCP on the node, so we
should not rely on changing this parameter. TCP_USER_TIMEOUT can be set per
socket, so if it works it would be the preferred solution.
---
Maybe we can use TCP_USER_TIMEOUT, as suggested here:
http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** assigned
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 P
>> It appears to me that we are hitting something similar like
>> >>"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive->>timer-delaying-disconnect"
Have you economized below configuration in /etc/opensaf/dtmd.conf ?
The above case disconnection is
- **status**: unassigned --> assigned
- **assigned_to**: A V Mahesh (AVM)
- **Component**: unknown --> dtm
- **Part**: lib --> -
- **Priority**: critical --> major
- **Comment**:
Can you please provide your Cluster environment ( OS / VM /container ) details
---
** [tickets:#2014] Rebooted con
---
** [tickets:#2014] Rebooted controller not detected in TCP**
**Status:** unassigned
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Thu Sep 08, 2016 06:20 PM UTC
**Owner:** nobody
**Attachments:**
-
[logs.tgz](https://sourceforge.net/p/ope
19 matches
Mail list logo