Re: Help! HAProxy randomly failing health checks!
Hey guys, just following up. Still running into the issue. From: Zachary Punches mailto:zpunc...@getcake.com>> Date: Friday, March 18, 2016 at 6:07 PM To: Igor Cicimov mailto:ig...@encompasscorporation.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! Ok! Here is a bunch of info that might better assist with the issue: Each of our clients has an HAProxy install that forwards requests for 80 and 443 to 1025 and 1026 respectively. These requests are forwarded over TCP using proxy protocol to our HAP instances. Our HAP instances then SSL term the request and forward them off to our backend on port 80. See attached diagram which should better explain the entire flow. During an outage due to the SSL handshakes failing, I was running TCPDump so I could look through and discover what was causing the failure, I was able to discover that we are receiving connection resets on some SSL connections. We then tested all the SSL certs from our client side to our side to verify that there is not a mismatched cert. This test was completed with no issues. Here is a connection reset packet I found in that TCP Dump 29525 158.09621710.1.4.11954.239.21.251TCP5438740 → 443 [RST] Seq=3533 Win=0 Len=0 Frame 29523: 54 bytes on wire (432 bits), 54 bytes captured (432 bits) Encapsulation type: Ethernet (1) Arrival Time: Mar 17, 2016 14:58:07.34584 PDT [Time shift for this packet: 0.0 seconds] Epoch Time: 1458251887.34584 seconds [Time delta from previous captured frame: 0.2 seconds] [Time delta from previous displayed frame: 0.021655000 seconds] [Time since reference or first frame: 158.096184000 seconds] Frame Number: 29523 Frame Length: 54 bytes (432 bits) Capture Length: 54 bytes (432 bits) [Frame is marked: False] [Frame is ignored: False] [Protocols in frame: eth:ethertype:ip:tcp] [Coloring Rule Name: TCP RST] [Coloring Rule String: tcp.flags.reset eq 1] Ethernet II, Src: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91), Dst: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58) Destination: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58) Address: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58) ..1. = LG bit: Locally administered address (this is NOT the factory default) ...0 = IG bit: Individual address (unicast) Source: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91) Address: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91) ..1. = LG bit: Locally administered address (this is NOT the factory default) ...0 = IG bit: Individual address (unicast) Type: IPv4 (0x0800) Internet Protocol Version 4, Src: $SRC IP Dst: $DST IP 0100 = Version: 4 0101 = Header Length: 20 bytes Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT) 00.. = Differentiated Services Codepoint: Default (0) ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0) Total Length: 40 Identification: 0x5f56 (24406) Flags: 0x02 (Don't Fragment) 0... = Reserved bit: Not set .1.. = Don't fragment: Set ..0. = More fragments: Not set Fragment offset: 0 Time to live: 64 Protocol: TCP (6) Header checksum: 0x8018 [validation disabled] [Good: False] [Bad: False] Source: $SourceIP Destination: $DestinationIP [Source GeoIP: Unknown] [Destination GeoIP: Unknown] Transmission Control Protocol, Src Port: 38740 (38740), Dst Port: 443 (443), Seq: 3533, Len: 0 Source Port: 38740 Destination Port: 443 [Stream index: 2799] [TCP Segment Len: 0] Sequence number: 3533(relative sequence number) Acknowledgment number: 0 Header Length: 20 bytes Flags: 0x004 (RST) 000. = Reserved: Not set ...0 = Nonce: Not set 0... = Congestion Window Reduced (CWR): Not set .0.. = ECN-Echo: Not set ..0. = Urgent: Not set ...0 = Acknowledgment: Not set 0... = Push: Not set .1.. = Reset: Set [Expert Info (Warn/Sequence): Connection reset (RST)] [Connection reset (RST)] [Severity level: Warn] [Group: Sequence] ..0. = Syn: Not set ...0 = Fin: Not set [TCP Flags: *R**] Window size value: 0 [Calculated window size: 0] [Window size scaling factor: 128] Checksum: 0x5c2f [validation disabled] [Good Checksum: False] [Bad Checksum: False] Urgent pointer: 0 From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Thursday, Ma
Re: Help! HAProxy randomly failing health checks!
hake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60376 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 68.116.153.225:57824 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60365 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60364 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60362 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:37490 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 108.59.8.48:43566 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:59763 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:59760 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60319 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60299 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60293 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60292 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60284 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60282 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:38664 [17/Mar/2016:18:37:45.736] shared_incoming/2: SSL handshake failure Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60270 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:33270 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 208.100.26.237:33273 [17/Mar/2016:18:37:45.736] shared_incoming/2: Connection error during SSL handshake Mar 17 18:37:45 localhost haproxy[28703]: 54.218.28.138:60089 [17/Mar/2016:18:37:06.938] shared_incoming~ shared_incoming/ -1/-1/-1/-1/0 400 187 - - CR-- 314/314/0/0/0 0/0 "" Mar 17 18:37:45 localhost haproxy[28703]: 109.154.74.227:53964 [17/Mar/2016:18:37:06.938] shared_incoming shared_incoming/ -1/-1/-1/-1/0 400 0 - - CR-- 313/313/0/0/0 0/0 "" Mar 17 18:37:45 localhost haproxy[28703]: 66.87.151.25:3325 [17/Mar/2016:18:37:06.938] shared_incoming shared_incoming/ -1/-1/-1/-1/0 400 0 - - CR-- 312/312/0/0/0 0/0 "" Mar 17 18:37:45 localhost haproxy[28703]: 108.59.8.48:33611 [17/Mar/2016:18:36:55.938] shared_incoming provedmedia/provedmedia_http 279/0/0/-1/279 -1 0 - - CD-- 311/311/91/91/0 0/0 "GET /?a=61&c=22008&s1=7346_0_1&s2=1_0_0_0_0_2102824_0_571_61811_0_0 HTTP/1.1" From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 5:01 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: Thanks for the reply! Ok so based on what you saw in my config, does it look like we’re misconfigured enough to cause this to happen? If we were misconfigured, one would assume we would go down all the time yeah? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 4:50 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 routing. If one
Re: Help! HAProxy randomly failing health checks!
Looking into a TCPdump its looking like some SSL connections are getting reset. The code being (Warn/Sequence): Connection reset (RST) Is that helpful at all? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 6:51 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 12:46 PM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 11:14 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I wanna say average is like 4-6 connections a second? Super minimal From what I’ve seen in the logs during the SSL errors, the log hangs then outputs a bunch of SSL errors all at once. Here it the output from sysctl –p net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.sysrq = 0 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key error: "net.bridge.bridge-nf-call-iptables" is an unknown key error: "net.bridge.bridge-nf-call-arptables" is an unknown key kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 What would lowering the tune.ssl.default-dh-param to 1024 do? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 5:01 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: Thanks for the reply! Ok so based on what you saw in my config, does it look like we’re misconfigured enough to cause this to happen? If we were misconfigured, one would assume we would go down all the time yeah? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 4:50 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks done a second) then Route53 changes its routing from the first proxy box to the second On 3/15/16, 9:46 PM, "Baptiste" mailto:bed...@gmail.com>> wrote: >Maybe you're checking a third party VM :) > AFAIK the Route53 health checks come from different points around the globe and it is possible that at some time of the day AWS has scheduled some specific end points to perform the HC. And it is possible that those ones have different SSL settings from the ones performing the HC during your day time. I would suggest you bring up this issue with AWS support, let them know your SSL cypher settings in HAP and ask if they are compatible with ALL their servers performing SSL health checks. I personally haven't seen any issues with failed SSL handshakes coming from AWS servers and have HAP's running in AU and UK regions. Igor That is if you are absolutely sure that the failed handshakes are not caused by overload or misconfigured (system) settings on HAP I was saying this in regards to system (kernel) settings. For example, assuming Unix/Linux is your net.core.somaxconn actually set *higher* than your maxconn which is set to 3 and 15000 in HAP? Any other kernel settings you might have changed? (output of "sysctl -p" command) What is your pick hour load, how many connections/sessions are you seeing on each HAP? Another suggestion is maybe set tune.ssl.default-dh-param to 1024 and see if that helps. Ok, so on default ubuntu cloud instance this is what we have: # sysctl -a | grep maxconn net.core.somaxconn = 128 which is too low for production server. Check your value and adjust it to your needs. By the way, what is accept-proxy doing there in your setup? Is there anything else in front of HAP using PROXY protocol? Wait a minute: bind *:1027 # Health checking port are you maybe using https health check on a non SSL port?
Re: Help! HAProxy randomly failing health checks!
I went ahead and added the performance tuning you recommended (changing the maxconn to 1024). Hopefully this adds some stability As for the port, we’re using 1027 for our SSL traffic vs 443. We are currently getting SSL traffic that isn’t always failing on handshake. As for what is in front of our HAP: Our clients have HAP servers that are forwarding requests off to our HAP setup From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 6:51 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 12:46 PM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 11:14 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I wanna say average is like 4-6 connections a second? Super minimal From what I’ve seen in the logs during the SSL errors, the log hangs then outputs a bunch of SSL errors all at once. Here it the output from sysctl –p net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.sysrq = 0 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key error: "net.bridge.bridge-nf-call-iptables" is an unknown key error: "net.bridge.bridge-nf-call-arptables" is an unknown key kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 What would lowering the tune.ssl.default-dh-param to 1024 do? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 5:01 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: Thanks for the reply! Ok so based on what you saw in my config, does it look like we’re misconfigured enough to cause this to happen? If we were misconfigured, one would assume we would go down all the time yeah? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 4:50 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks done a second) then Route53 changes its routing from the first proxy box to the second On 3/15/16, 9:46 PM, "Baptiste" mailto:bed...@gmail.com>> wrote: >Maybe you're checking a third party VM :) > AFAIK the Route53 health checks come from different points around the globe and it is possible that at some time of the day AWS has scheduled some specific end points to perform the HC. And it is possible that those ones have different SSL settings from the ones performing the HC during your day time. I would suggest you bring up this issue with AWS support, let them know your SSL cypher settings in HAP and ask if they are compatible with ALL their servers performing SSL health checks. I personally haven't seen any issues with failed SSL handshakes coming from AWS servers and have HAP's running in AU and UK regions. Igor That is if you are absolutely sure that the failed handshakes are not caused by overload or misconfigured (system) settings on HAP I was saying this in regards to system (kernel) settings. For example, assuming Unix/Linux is your net.core.somaxconn actually set *higher* than your maxconn which is set to 3 and 15000 in HAP? Any other kernel settings you might have changed? (output of "sysctl -p" command) What is your pick hour load, how many connections/sessions are you seeing on each HAP? Another suggestion is maybe set tune.ssl.default-dh-param to 1024 and see if that helps. Ok, so on default ubuntu cloud instance this is what we have: # sysctl -a | grep maxconn net.core.somaxconn = 128 which is too low for production server. Check your value and adjust it to your needs. By the way, what is accept-proxy doing there in your setup? Is there anything else in front of HAP using PROXY protocol? Wait a minute: bind *:1027 # Health checking port are you maybe using https health check on a non SSL port?
Re: Help! HAProxy randomly failing health checks!
Yeah port 1027 is used for health checks over SSL. This HAP forwards requests off to our databases. The databases have a string in a table that indicates that the HAP instance can move all the way through the entire process before it lights as green. Our health checks in route 53 are setup to ping 1027 as the SSL port From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Thursday, March 17, 2016 at 4:18 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! So is port 1027 used for health checks over SSL or not? I don't see any ssl settings on that port.
Re: Help! HAProxy randomly failing health checks!
I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks done a second) then Route53 changes its routing from the first proxy box to the second On 3/15/16, 9:46 PM, "Baptiste" wrote: >Maybe you're checking a third party VM :) >
Re: Help! HAProxy randomly failing health checks!
I wanna say average is like 4-6 connections a second? Super minimal From what I’ve seen in the logs during the SSL errors, the log hangs then outputs a bunch of SSL errors all at once. Here it the output from sysctl –p net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.sysrq = 0 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key error: "net.bridge.bridge-nf-call-iptables" is an unknown key error: "net.bridge.bridge-nf-call-arptables" is an unknown key kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 What would lowering the tune.ssl.default-dh-param to 1024 do? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 5:01 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:55 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: Thanks for the reply! Ok so based on what you saw in my config, does it look like we’re misconfigured enough to cause this to happen? If we were misconfigured, one would assume we would go down all the time yeah? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 4:50 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks done a second) then Route53 changes its routing from the first proxy box to the second On 3/15/16, 9:46 PM, "Baptiste" mailto:bed...@gmail.com>> wrote: >Maybe you're checking a third party VM :) > AFAIK the Route53 health checks come from different points around the globe and it is possible that at some time of the day AWS has scheduled some specific end points to perform the HC. And it is possible that those ones have different SSL settings from the ones performing the HC during your day time. I would suggest you bring up this issue with AWS support, let them know your SSL cypher settings in HAP and ask if they are compatible with ALL their servers performing SSL health checks. I personally haven't seen any issues with failed SSL handshakes coming from AWS servers and have HAP's running in AU and UK regions. Igor That is if you are absolutely sure that the failed handshakes are not caused by overload or misconfigured (system) settings on HAP I was saying this in regards to system (kernel) settings. For example, assuming Unix/Linux is your net.core.somaxconn actually set *higher* than your maxconn which is set to 3 and 15000 in HAP? Any other kernel settings you might have changed? (output of "sysctl -p" command) What is your pick hour load, how many connections/sessions are you seeing on each HAP? Another suggestion is maybe set tune.ssl.default-dh-param to 1024 and see if that helps.
Re: Help! HAProxy randomly failing health checks!
Thanks for the reply! Ok so based on what you saw in my config, does it look like we’re misconfigured enough to cause this to happen? If we were misconfigured, one would assume we would go down all the time yeah? From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Wednesday, March 16, 2016 at 4:50 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Thu, Mar 17, 2016 at 10:47 AM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Thu, Mar 17, 2016 at 5:29 AM, Zachary Punches mailto:zpunc...@getcake.com>> wrote: I’m not, these guys aren’t sitting behind an ELB. They sit behind route53 routing. If one of the proxy boxes fails 3 checks in 30 seconds (with 4 checks done a second) then Route53 changes its routing from the first proxy box to the second On 3/15/16, 9:46 PM, "Baptiste" mailto:bed...@gmail.com>> wrote: >Maybe you're checking a third party VM :) > AFAIK the Route53 health checks come from different points around the globe and it is possible that at some time of the day AWS has scheduled some specific end points to perform the HC. And it is possible that those ones have different SSL settings from the ones performing the HC during your day time. I would suggest you bring up this issue with AWS support, let them know your SSL cypher settings in HAP and ask if they are compatible with ALL their servers performing SSL health checks. I personally haven't seen any issues with failed SSL handshakes coming from AWS servers and have HAP's running in AU and UK regions. Igor That is if you are absolutely sure that the failed handshakes are not caused by overload or misconfigured (system) settings on HAP
Re: Help! HAProxy randomly failing health checks!
Ok! Here is a bunch of info that might better assist with the issue: Each of our clients has an HAProxy install that forwards requests for 80 and 443 to 1025 and 1026 respectively. These requests are forwarded over TCP using proxy protocol to our HAP instances. Our HAP instances then SSL term the request and forward them off to our backend on port 80. See attached diagram which should better explain the entire flow. During an outage due to the SSL handshakes failing, I was running TCPDump so I could look through and discover what was causing the failure, I was able to discover that we are receiving connection resets on some SSL connections. We then tested all the SSL certs from our client side to our side to verify that there is not a mismatched cert. This test was completed with no issues. Here is a connection reset packet I found in that TCP Dump 29525 158.09621710.1.4.119 54.239.21.251TCP 5438740 → 443 [RST] Seq=3533 Win=0 Len=0 Frame 29523: 54 bytes on wire (432 bits), 54 bytes captured (432 bits) Encapsulation type: Ethernet (1) Arrival Time: Mar 17, 2016 14:58:07.34584 PDT [Time shift for this packet: 0.0 seconds] Epoch Time: 1458251887.34584 seconds [Time delta from previous captured frame: 0.2 seconds] [Time delta from previous displayed frame: 0.021655000 seconds] [Time since reference or first frame: 158.096184000 seconds] Frame Number: 29523 Frame Length: 54 bytes (432 bits) Capture Length: 54 bytes (432 bits) [Frame is marked: False] [Frame is ignored: False] [Protocols in frame: eth:ethertype:ip:tcp] [Coloring Rule Name: TCP RST] [Coloring Rule String: tcp.flags.reset eq 1] Ethernet II, Src: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91), Dst: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58) Destination: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58) Address: 1e:8f:a6:6b:52:58 (1e:8f:a6:6b:52:58) ..1. = LG bit: Locally administered address (this is NOT the factory default) ...0 = IG bit: Individual address (unicast) Source: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91) Address: 12:8d:18:05:0f:91 (12:8d:18:05:0f:91) ..1. = LG bit: Locally administered address (this is NOT the factory default) ...0 = IG bit: Individual address (unicast) Type: IPv4 (0x0800) Internet Protocol Version 4, Src: $SRC IP Dst: $DST IP 0100 = Version: 4 0101 = Header Length: 20 bytes Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT) 00.. = Differentiated Services Codepoint: Default (0) ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0) Total Length: 40 Identification: 0x5f56 (24406) Flags: 0x02 (Don't Fragment) 0... = Reserved bit: Not set .1.. = Don't fragment: Set ..0. = More fragments: Not set Fragment offset: 0 Time to live: 64 Protocol: TCP (6) Header checksum: 0x8018 [validation disabled] [Good: False] [Bad: False] Source: $SourceIP Destination: $DestinationIP [Source GeoIP: Unknown] [Destination GeoIP: Unknown] Transmission Control Protocol, Src Port: 38740 (38740), Dst Port: 443 (443), Seq: 3533, Len: 0 Source Port: 38740 Destination Port: 443 [Stream index: 2799] [TCP Segment Len: 0] Sequence number: 3533(relative sequence number) Acknowledgment number: 0 Header Length: 20 bytes Flags: 0x004 (RST) 000. = Reserved: Not set ...0 = Nonce: Not set 0... = Congestion Window Reduced (CWR): Not set .0.. = ECN-Echo: Not set ..0. = Urgent: Not set ...0 = Acknowledgment: Not set 0... = Push: Not set .1.. = Reset: Set [Expert Info (Warn/Sequence): Connection reset (RST)] [Connection reset (RST)] [Severity level: Warn] [Group: Sequence] ..0. = Syn: Not set ...0 = Fin: Not set [TCP Flags: *R**] Window size value: 0 [Calculated window size: 0] [Window size scaling factor: 128] Checksum: 0x5c2f [validation disabled] [Good Checksum: False] [Bad Checksum: False] Urgent pointer: 0 From: Igor Cicimov mailto:ig...@encompasscorporation.com>> Date: Thursday, March 17, 2016 at 8:40 PM To: Zachary Punches mailto:zpunc...@getcake.com>> Cc: Baptiste mailto:bed...@gmail.com>>, "haproxy@formilux.org<mailto:haproxy@formilux.org>" mailto:haproxy@formilux.org>> Subject: Re: Help! HAProxy randomly failing health checks! On Fri, Mar 18, 2016 at 1:38 PM, Igor Cicimov mailto:ig...@encompasscorporation.com>> wrote: On Fri, Mar 18, 2016 at 12:04 PM, Zachary
Help! HAProxy randomly failing health checks!
Hello! My name is Zack, and I have been in the middle of an on going HAProxy issue that has me scratching my head. Here is the setup: Our setup is hosted by amazon, and our HAProxy (1.6.3) boxes are in each region in 3 regions. We have 2 HAProxy boxes per region for a total of 6 proxy boxes. These boxes are routed information through route 53. Their entire job is to forward data from one of our clients to our database backend. It handles this absolutely fine, except between the hours of 7pm PST and 7am PST. During these hours, our route53 health checks time out thus causing the traffic to switch to the other HAProxy box inside of the same region. During the other 12 hours of the day, we receive 0 alerts from our health checks. I have noticed that we get a series of SSL handshake failures (though this happens throughout the entire day) that causes the server to hang for a second, thus causing the health checks to fail. During the day our SSL failures do not cause the server to hang long enough to go fail the checks, they only fail at night. I have attached my HAProxy config hoping that you guys have an answer for me. Lemme know if you need any more info. I have done a few tcpdump captures during the SSL handshake failures (not at night during it failing, but during the day when it still gets the SSL handshake failures, but doesn’t fail the health check) and it seems there is a d/c and a reconnect during the handshake. Here is my config, I will be running a tcpdump tonight to capture the packets during the failure and will attach it if you guys need more info. #- # Example configuration for a possible web application. See the # full configuration options online. # # http://haproxy.1wt.eu/download/1.4/doc/configuration.txt # #- #- # Global settings #- global log 127.0.0.1 local2 pidfile /var/run/haproxy.pid maxconn 3 userhaproxy group haproxy daemon ssl-default-bind-options no-sslv3 no-tls-tickets tune.ssl.default-dh-param 2048 # turn on stats unix socket #stats socket /var/lib/haproxy/stats` #- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #- defaults modehttp log global option httplog retries 3 timeout http-request5s timeout queue 1m timeout connect 31s timeout client 31s timeout server 31s maxconn 15000 # Stats statsenable stats uri /haproxy?stats stats realm Strictly\ Private stats auth$StatsUser:$StatsPass #- # main frontend which proxys to the backends #- frontend shared_incoming maxconn 15000 timeout http-request 5s #Bind ports of incoming traffic bind *:1025 accept-proxy # http bind *:1026 accept-proxy ssl crt /path/to/default/ssl/cert.pem ssl crt /path/to/cert/folder/ # https bind *:1027 # Health checking port acl gs_texthtml url_reg \/gstext\.html## allow gs to do meta tag verififcation acl gs_user_agent hdr_sub(User-Agent) -i globalsign## allow gs to do meta tag verififcation # Add headers http-request set-header $Proxy-Header-Ip %[src] http-request set-header $Proxy-Header-Proto http if !{ ssl_fc } http-request set-header $Proxy-Header-Proto https if { ssl_fc } # Route traffic based on domain use_backend gs_verify if gs_texthtml or gs_user_agent## allow gs meta tag verification use_backend %[req.hdr(host),lower,map_dom(/path/to/map/file.map,unknown_domain)] # Drop unrecognized traffic default_backend unknown_domain #- # Backends #- backend server0 ## added to allow gs ssl meta tag verification reqrep ^GET\ /.*\ (HTTP/.*)GET\ /GlobalSignVerification\ \1 server server0_http server0.domain.com:80/GlobalSignVerification/ backend server1 server server1_http server1.domain.net:80 backend server2 server server2_http server2.domain.net:80 backend server3 server server3_http server3.domain.net:80 backend server4 server server4_http server4.domain.net:80 backend server5 se