Re: Help! HAProxy randomly failing health checks!

2016-03-24 Thread Zachary Punches
Hey guys, just following up. Still running into the issue.


From: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Date: Friday, March 18, 2016 at 6:07 PM
To: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!

Ok! Here is a bunch of info that might better assist with the issue:


Each of our clients has an HAProxy install that forwards requests for 80 and 
443 to 1025 and 1026 respectively. These requests are forwarded over TCP using 
proxy protocol to our HAP instances.
Our HAP instances then SSL term the request and forward them off to our backend 
on port 80.

See attached diagram which should better explain the entire flow.

During an outage due to the SSL handshakes failing, I was running TCPDump so I 
could look through and discover what was causing the failure, I was able to 
discover that we are receiving connection resets on some SSL connections. We 
then tested all the SSL certs from our client side to our side to verify that 
there is not a mismatched cert. This test was completed with no issues.

Here is a connection reset packet I found in that TCP Dump

29525 158.09621710.1.4.11954.239.21.251TCP5438740 → 443 [RST] Seq=3533 Win=0 
Len=0

Frame 29523: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Mar 17, 2016 14:58:07.34584 PDT
[Time shift for this packet: 0.0 seconds]
Epoch Time: 1458251887.34584 seconds
[Time delta from previous captured frame: 0.2 seconds]
[Time delta from previous displayed frame: 0.021655000 seconds]
[Time since reference or first frame: 158.096184000 seconds]
Frame Number: 29523
Frame Length: 54 bytes (432 bits)
Capture Length: 54 bytes (432 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp]
[Coloring Rule Name: TCP RST]
[Coloring Rule String: tcp.flags.reset eq 1]
Ethernet II, Src: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91), Dst: 1e:8f:a6:6b:52:58 
(1e:8f:a6:6b:52:58)
Destination: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58)
Address: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58)
 ..1.     = LG bit: Locally administered address 
(this is NOT the factory default)
 ...0     = IG bit: Individual address (unicast)
Source: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91)
Address: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91)
 ..1.     = LG bit: Locally administered address 
(this is NOT the factory default)
 ...0     = IG bit: Individual address (unicast)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: $SRC IP Dst: $DST IP
0100  = Version: 4
 0101 = Header Length: 20 bytes
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
 00.. = Differentiated Services Codepoint: Default (0)
 ..00 = Explicit Congestion Notification: Not ECN-Capable Transport 
(0)
Total Length: 40
Identification: 0x5f56 (24406)
Flags: 0x02 (Don't Fragment)
0...  = Reserved bit: Not set
.1..  = Don't fragment: Set
..0.  = More fragments: Not set
Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0x8018 [validation disabled]
[Good: False]
[Bad: False]
Source: $SourceIP
Destination: $DestinationIP
[Source GeoIP: Unknown]
[Destination GeoIP: Unknown]
Transmission Control Protocol, Src Port: 38740 (38740), Dst Port: 443 (443), 
Seq: 3533, Len: 0
Source Port: 38740
Destination Port: 443
[Stream index: 2799]
[TCP Segment Len: 0]
Sequence number: 3533(relative sequence number)
Acknowledgment number: 0
Header Length: 20 bytes
Flags: 0x004 (RST)
000.   = Reserved: Not set
...0   = Nonce: Not set
 0...  = Congestion Window Reduced (CWR): Not set
 .0..  = ECN-Echo: Not set
 ..0.  = Urgent: Not set
 ...0  = Acknowledgment: Not set
  0... = Push: Not set
  .1.. = Reset: Set
[Expert Info (Warn/Sequence): Connection reset (RST)]
[Connection reset (RST)]
[Severity level: Warn]
[Group: Sequence]
  ..0. = Syn: Not set
  ...0 = Fin: Not set
[TCP Flags: *R**]
Window size value: 0
[Calculated window size: 0]
[Window size scaling factor: 128]
Checksum: 0x5c2f [validation disabled]
[Good Checksum: False]
[Bad Che

Re: Help! HAProxy randomly failing health checks!

2016-03-20 Thread Zachary Punches
hake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60376 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 68.116.153.225:57824 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60365 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60364 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60362 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:37490 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 108.59.8.48:43566 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:59763 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:59760 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60319 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60299 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60293 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60292 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60284 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60282 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:38664 
[17/Mar/2016:18:37:45.736] shared_incoming/2: SSL handshake failure
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60270 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:33270 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:33273 
[17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL 
handshake
Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60089 
[17/Mar/2016:18:37:06.938] shared_incoming~ shared_incoming/ 
-1/-1/-1/-1/0 400 187 - - CR-- 314/314/0/0/0 0/0 ""
Mar 17 18:37:45 localhost haproxy[28703]: 109.154.74.227:53964 
[17/Mar/2016:18:37:06.938] shared_incoming shared_incoming/ 
-1/-1/-1/-1/0 400 0 - - CR-- 313/313/0/0/0 0/0 ""
Mar 17 18:37:45 localhost haproxy[28703]: 66.87.151.25:3325 
[17/Mar/2016:18:37:06.938] shared_incoming shared_incoming/ 
-1/-1/-1/-1/0 400 0 - - CR-- 312/312/0/0/0 0/0 ""
Mar 17 18:37:45 localhost haproxy[28703]: 108.59.8.48:33611 
[17/Mar/2016:18:36:55.938] shared_incoming provedmedia/provedmedia_http 
279/0/0/-1/279 -1 0 - - CD-- 311/311/91/91/0 0/0 "GET 
/?a=61=22008=7346_0_1=1_0_0_0_0_2102824_0_571_61811_0_0 HTTP/1.1"

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 5:01 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
Thanks for the reply!

Ok so based on what you saw in my config, does it look like we’re misconfigured 
enough to cause this to happen?

If we were misconfigured, one would assume we would go down all the time yeah?

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 4:50 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:47 AM, Igor Ci

Re: Help! HAProxy randomly failing health checks!

2016-03-19 Thread Zachary Punches
I went ahead and added the performance tuning you recommended (changing the 
maxconn to 1024). Hopefully this adds some stability

As for the port, we’re using 1027 for our SSL traffic vs 443. We are currently 
getting SSL traffic that isn’t always failing on handshake.

As for what is in front of our HAP:
Our clients have HAP servers that are forwarding requests off to our HAP setup

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 6:51 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 12:46 PM, Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>> wrote:


On Thu, Mar 17, 2016 at 11:14 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
I wanna say average is like 4-6 connections a second? Super minimal

From what I’ve seen in the logs during the SSL errors, the log hangs then 
outputs a bunch of SSL errors all at once.

Here it the output from sysctl –p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296

What would lowering the tune.ssl.default-dh-param to 1024 do?
From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 5:01 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
Thanks for the reply!

Ok so based on what you saw in my config, does it look like we’re misconfigured 
enough to cause this to happen?

If we were misconfigured, one would assume we would go down all the time yeah?

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 4:50 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>> wrote:


On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 
routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks 
done a second) then Route53 changes its routing from the first proxy box to the 
second




On 3/15/16, 9:46 PM, "Baptiste" <bed...@gmail.com<mailto:bed...@gmail.com>> 
wrote:

>Maybe you're checking a third party VM :)
>

AFAIK the Route53 health checks come from different points around the globe and 
it is possible that at some time of the day AWS has scheduled some specific end 
points to perform the HC. And it is possible that those ones have different SSL 
settings from the ones performing the HC during your day time. I would suggest 
you bring up this issue with AWS support, let them know your SSL cypher 
settings in HAP and ask if they are compatible with ALL their servers 
performing SSL health checks.

I personally haven't seen any issues with failed SSL handshakes coming from AWS 
servers and have HAP's running in AU and UK regions.

Igor

That is if you are absolutely sure that the failed handshakes are not caused by 
overload or misconfigured (system) settings on HAP


I was saying this in regards to system (kernel) settings. For example, assuming 
Unix/Linux is your net.core.somaxconn actually set *higher* than your maxconn 
which is set to 3 and 15000 in HAP? Any other kernel settings you might 
have changed? (output of "sysctl -p" com

Re: Help! HAProxy randomly failing health checks!

2016-03-19 Thread Zachary Punches
Yeah port 1027 is used for health checks over SSL.

This HAP forwards requests off to our databases. The databases have a string in 
a table that indicates that the HAP instance can move all the way through the 
entire process before it lights as green.

Our health checks in route 53 are setup to ping 1027 as the SSL port

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Thursday, March 17, 2016 at 4:18 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!

So is port 1027 used for health checks over SSL or not? I don't see any ssl 
settings on that port.


Re: Help! HAProxy randomly failing health checks!

2016-03-19 Thread Zachary Punches
I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 
routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks 
done a second) then Route53 changes its routing from the first proxy box to the 
second




On 3/15/16, 9:46 PM, "Baptiste"  wrote:

>Maybe you're checking a third party VM :)
>


Re: Help! HAProxy randomly failing health checks!

2016-03-19 Thread Zachary Punches
I wanna say average is like 4-6 connections a second? Super minimal

From what I’ve seen in the logs during the SSL errors, the log hangs then 
outputs a bunch of SSL errors all at once.

Here it the output from sysctl –p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296

What would lowering the tune.ssl.default-dh-param to 1024 do?
From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 5:01 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
Thanks for the reply!

Ok so based on what you saw in my config, does it look like we’re misconfigured 
enough to cause this to happen?

If we were misconfigured, one would assume we would go down all the time yeah?

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 4:50 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>> wrote:


On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 
routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks 
done a second) then Route53 changes its routing from the first proxy box to the 
second




On 3/15/16, 9:46 PM, "Baptiste" <bed...@gmail.com<mailto:bed...@gmail.com>> 
wrote:

>Maybe you're checking a third party VM :)
>

AFAIK the Route53 health checks come from different points around the globe and 
it is possible that at some time of the day AWS has scheduled some specific end 
points to perform the HC. And it is possible that those ones have different SSL 
settings from the ones performing the HC during your day time. I would suggest 
you bring up this issue with AWS support, let them know your SSL cypher 
settings in HAP and ask if they are compatible with ALL their servers 
performing SSL health checks.

I personally haven't seen any issues with failed SSL handshakes coming from AWS 
servers and have HAP's running in AU and UK regions.

Igor

That is if you are absolutely sure that the failed handshakes are not caused by 
overload or misconfigured (system) settings on HAP


I was saying this in regards to system (kernel) settings. For example, assuming 
Unix/Linux is your net.core.somaxconn actually set *higher* than your maxconn 
which is set to 3 and 15000 in HAP? Any other kernel settings you might 
have changed? (output of "sysctl -p" command)

What is your pick hour load, how many connections/sessions are you seeing on 
each HAP?

Another suggestion is maybe set tune.ssl.default-dh-param to 1024 and see if 
that helps.


Re: Help! HAProxy randomly failing health checks!

2016-03-18 Thread Zachary Punches
Thanks for the reply!

Ok so based on what you saw in my config, does it look like we’re misconfigured 
enough to cause this to happen?

If we were misconfigured, one would assume we would go down all the time yeah?

From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Wednesday, March 16, 2016 at 4:50 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>> wrote:


On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches 
<zpunc...@getcake.com<mailto:zpunc...@getcake.com>> wrote:
I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 
routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks 
done a second) then Route53 changes its routing from the first proxy box to the 
second




On 3/15/16, 9:46 PM, "Baptiste" <bed...@gmail.com<mailto:bed...@gmail.com>> 
wrote:

>Maybe you're checking a third party VM :)
>

AFAIK the Route53 health checks come from different points around the globe and 
it is possible that at some time of the day AWS has scheduled some specific end 
points to perform the HC. And it is possible that those ones have different SSL 
settings from the ones performing the HC during your day time. I would suggest 
you bring up this issue with AWS support, let them know your SSL cypher 
settings in HAP and ask if they are compatible with ALL their servers 
performing SSL health checks.

I personally haven't seen any issues with failed SSL handshakes coming from AWS 
servers and have HAP's running in AU and UK regions.

Igor

That is if you are absolutely sure that the failed handshakes are not caused by 
overload or misconfigured (system) settings on HAP



Re: Help! HAProxy randomly failing health checks!

2016-03-18 Thread Zachary Punches
Ok! Here is a bunch of info that might better assist with the issue:


Each of our clients has an HAProxy install that forwards requests for 80 and 
443 to 1025 and 1026 respectively. These requests are forwarded over TCP using 
proxy protocol to our HAP instances.
Our HAP instances then SSL term the request and forward them off to our backend 
on port 80.

See attached diagram which should better explain the entire flow.

During an outage due to the SSL handshakes failing, I was running TCPDump so I 
could look through and discover what was causing the failure, I was able to 
discover that we are receiving connection resets on some SSL connections. We 
then tested all the SSL certs from our client side to our side to verify that 
there is not a mismatched cert. This test was completed with no issues.

Here is a connection reset packet I found in that TCP Dump

29525 158.09621710.1.4.119 54.239.21.251TCP 5438740 → 443 [RST] Seq=3533 Win=0 
Len=0

Frame 29523: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Mar 17, 2016 14:58:07.34584 PDT
[Time shift for this packet: 0.0 seconds]
Epoch Time: 1458251887.34584 seconds
[Time delta from previous captured frame: 0.2 seconds]
[Time delta from previous displayed frame: 0.021655000 seconds]
[Time since reference or first frame: 158.096184000 seconds]
Frame Number: 29523
Frame Length: 54 bytes (432 bits)
Capture Length: 54 bytes (432 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp]
[Coloring Rule Name: TCP RST]
[Coloring Rule String: tcp.flags.reset eq 1]
Ethernet II, Src: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91), Dst: 1e:8f:a6:6b:52:58 
(1e:8f:a6:6b:52:58)
Destination: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58)
Address: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58)
 ..1.     = LG bit: Locally administered address 
(this is NOT the factory default)
 ...0     = IG bit: Individual address (unicast)
Source: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91)
Address: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91)
 ..1.     = LG bit: Locally administered address 
(this is NOT the factory default)
 ...0     = IG bit: Individual address (unicast)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: $SRC IP Dst: $DST IP
0100  = Version: 4
 0101 = Header Length: 20 bytes
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
 00.. = Differentiated Services Codepoint: Default (0)
 ..00 = Explicit Congestion Notification: Not ECN-Capable Transport 
(0)
Total Length: 40
Identification: 0x5f56 (24406)
Flags: 0x02 (Don't Fragment)
0...  = Reserved bit: Not set
.1..  = Don't fragment: Set
..0.  = More fragments: Not set
Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0x8018 [validation disabled]
[Good: False]
[Bad: False]
Source: $SourceIP
Destination: $DestinationIP
[Source GeoIP: Unknown]
[Destination GeoIP: Unknown]
Transmission Control Protocol, Src Port: 38740 (38740), Dst Port: 443 (443), 
Seq: 3533, Len: 0
Source Port: 38740
Destination Port: 443
[Stream index: 2799]
[TCP Segment Len: 0]
Sequence number: 3533(relative sequence number)
Acknowledgment number: 0
Header Length: 20 bytes
Flags: 0x004 (RST)
000.   = Reserved: Not set
...0   = Nonce: Not set
 0...  = Congestion Window Reduced (CWR): Not set
 .0..  = ECN-Echo: Not set
 ..0.  = Urgent: Not set
 ...0  = Acknowledgment: Not set
  0... = Push: Not set
  .1.. = Reset: Set
[Expert Info (Warn/Sequence): Connection reset (RST)]
[Connection reset (RST)]
[Severity level: Warn]
[Group: Sequence]
  ..0. = Syn: Not set
  ...0 = Fin: Not set
[TCP Flags: *R**]
Window size value: 0
[Calculated window size: 0]
[Window size scaling factor: 128]
Checksum: 0x5c2f [validation disabled]
[Good Checksum: False]
[Bad Checksum: False]
Urgent pointer: 0





From: Igor Cicimov 
<ig...@encompasscorporation.com<mailto:ig...@encompasscorporation.com>>
Date: Thursday, March 17, 2016 at 8:40 PM
To: Zachary Punches <zpunc...@getcake.com<mailto:zpunc...@getcake.com>>
Cc: Baptiste <bed...@gmail.com<mailto:bed...@gmail.com>>, 
"haproxy@formilux.org<mailto:haproxy@formilux.org>" 
<haproxy@formilux.org<mailto:haproxy@formilux.org>>
Subject: Re: Help! HAProxy randomly failing health checks!



On Fri, Mar 18, 2016 at 

Help! HAProxy randomly failing health checks!

2016-03-15 Thread Zachary Punches
Hello!

My name is Zack, and I have been in the middle of an on going HAProxy issue 
that has me scratching my head.

Here is the setup:

Our setup is hosted by amazon, and our HAProxy (1.6.3) boxes are in each region 
in 3 regions. We have 2 HAProxy boxes per region for a total of 6 proxy boxes.

These boxes are routed information through route 53. Their entire job is to 
forward data from one of our clients to our database backend. It handles this 
absolutely fine, except between the hours of 7pm PST and 7am PST. During these 
hours, our route53 health checks time out thus causing the traffic to switch to 
the other HAProxy box inside of the same region.

During the other 12 hours of the day, we receive 0 alerts from our health 
checks.

I have noticed that we get a series of SSL handshake failures (though this 
happens throughout the entire day) that causes the server to hang for a second, 
thus causing the health checks to fail. During the day our SSL failures do not 
cause the server to hang long enough to go fail the checks, they only fail at 
night. I have attached my HAProxy config hoping that you guys have an answer 
for me. Lemme know if you need any more info.

I have done a few tcpdump captures during the SSL handshake failures (not at 
night during it failing, but during the day when it still gets the SSL 
handshake failures, but doesn’t fail the health check) and it seems there is a 
d/c and a reconnect during the handshake.

Here is my config, I will be running a tcpdump tonight to capture the packets 
during the failure and will attach it if you guys need more info.

 #-
# Example configuration for a possible web application.  See the
# full configuration options online.
#
#   http://haproxy.1wt.eu/download/1.4/doc/configuration.txt
#
#-

#-
# Global settings
#-
global
log 127.0.0.1 local2

pidfile /var/run/haproxy.pid
maxconn 3
userhaproxy
group   haproxy
daemon
ssl-default-bind-options no-sslv3 no-tls-tickets
tune.ssl.default-dh-param 2048

 # turn on stats unix socket
#stats socket /var/lib/haproxy/stats`

#-
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#-
defaults
modehttp
log global
option  httplog
retries 3
timeout http-request5s
timeout queue   1m
timeout connect 31s
timeout client  31s
timeout server  31s
maxconn 15000

# Stats
statsenable
stats uri   /haproxy?stats
stats realm  Strictly\ Private
stats auth$StatsUser:$StatsPass

#-
# main frontend which proxys to the backends
#-

frontend shared_incoming
maxconn 15000
timeout http-request 5s

#Bind ports of incoming traffic
bind *:1025 accept-proxy # http
bind *:1026 accept-proxy ssl crt /path/to/default/ssl/cert.pem ssl crt 
/path/to/cert/folder/ # https
bind *:1027 # Health checking port
acl gs_texthtml url_reg \/gstext\.html## allow gs to do meta tag 
verififcation
 acl gs_user_agent hdr_sub(User-Agent) -i globalsign## allow gs 
to do meta tag verififcation

#  Add headers
http-request set-header $Proxy-Header-Ip %[src]
http-request set-header $Proxy-Header-Proto http if !{ ssl_fc }
http-request set-header $Proxy-Header-Proto https if { ssl_fc }

# Route traffic based on domain
use_backend gs_verify if gs_texthtml or gs_user_agent## allow gs meta 
tag verification
use_backend 
%[req.hdr(host),lower,map_dom(/path/to/map/file.map,unknown_domain)]

# Drop unrecognized traffic
default_backend unknown_domain

#-
# Backends
#-

backend server0  ## added to allow gs ssl meta tag verification
reqrep ^GET\ /.*\ (HTTP/.*)GET\ /GlobalSignVerification\ \1
server server0_http server0.domain.com:80/GlobalSignVerification/

backend server1
server server1_http server1.domain.net:80

backend server2
server server2_http server2.domain.net:80

backend server3
server server3_http server3.domain.net:80

backend server4
server server4_http server4.domain.net:80

backend server5