Hi Pawel,

On Thu, May 21, 2015 at 01:04:42PM -0700, Pawel Veselov wrote:
> Wiilly, Lucas, thank you so much for analyzing my configs and your help.
> 
> We did find out what was wrong.
> 
> Some long time ago we added 'option nolinger' to the defaults section. This
> was figured by trial and error, and that option, on 1.4, served us well to
> the point of us forgetting about it.

It should not be used in defaults since it will affect the frontend and
will result in truncated responses for far remote clients due to resets
being sent for unacked data.

> When were trying out 1.5, we started
> experiencing phantom RSTs that, of course, were happening because of
> nolinger. At which point we said "nolinger is evil and should not be used",
> and took it out all together.

exactly!

> The problems that we were experiencing after
> version update didn't remind us of the time where we had to put that
> nolinger in to combat exactly the same symptoms. We rolled the haproxy
> version back, and of course it didn't make things any better until we put
> the nolinger back in. I think that in 1.4, nolinger in defaults section
> only affected backend side, and in 1.5 it started applying to both.

I don't remember the possibility that it didn't affect the frontend in
1.4 since it was initially introduced for the frontend to work around
one client bug. However, it would be possible that there was a bug in
1.4 causing it to be ineffective. Or maybe just that it was used *after*
the close response from the client, making it useless.

> I honestly don't quite understand what's wrong with our set up except that:
> - it's not the application per se, because the application processes the
> same volume of the requests just fine (when nolinger is turned on)
> - it's either the network framework (J2EE) settings or kernel settings on
> the application side that are to fault

I think you're in a situation where the application doesn't close and lets
the client close first. When using TCP, a client must never ever close first
because it leaves TIME_WAIT on it and it cannot reuse source ports for 1-2
minutes depending on the OS. That would explain why you needed "nolinger"
since it forces a TCP RST to the server as well. And given you were running
with "option httpclose" which is the passive mode, haproxy didn't interact
in the connection close.

Thus if you used "option http-server-close" or even "option forceclose",
haproxy would actively close the connection itself and use RST to the
server but a clean FIN to the client.

> The failure comes when the session rate reaches some critical volume.

Yes, and I can even give you the session rate : #source ports / TW timeout.
If you have the default 28000 ports and 60s timeouts, that's a bit less than
500 sessions per second before all ports are in TIME_WAIT and cannot be
reused.

> At
> that point we see sharp rise of concurrent sessions (sockets in ESTAB state
> rise the same), then sharp rise of sockets in SYN-SENT state,

If you see SYN-SENT on the client, it can mean you have a firewall between
haproxy and the server which prevents connections from establishing using
the same source ports. Normally we don't find such bogus firewalls anymore.
However, a few types of firewalls do wrongly randomize source ports and/or
sequence numbers and break TCP because of this, because when the transformed
SYN reaches the server, it doesn't match the previous session and cannot
recycle it. One firewall was known for this, the Cisco FWSM. However,
having time-stamps enabled on both haproxy and the server worked around it.
Another solution consisted in fixing the firewall using "no normalization"
to disable the bogus behaviour.

> and then the
> whole TCP stack stalls (no impact on existing TCP connections though, like
> any open SSH shells are not affected to the slightest). There aren't really
> any problems with unfinished sockets (high number of sockets are only in
> ESTAB, SYN-SENT and TIME-WAIT states), there is a moderate (few thousand)
> amount of sockets in client-side TIME_WAIT, but nowhere close to numbers
> that would cause a problem.

Except if they consume all source ports :-)
Where do you see them, between client and haproxy (normal and harmless) or
between haproxy and server ?

> So, the stall really occurs on the network side of the application, in turn
> causing haproxy to stall. I'll see if I can eventually create steps to
> reproduce that stall in a simple environment, at least to figure out if its
> haproxy queue that gets stalled, or the whole TCP stack of the machine
> where haproxy is running, but I don't understand how that scenario would
> cause the haproxy machine to stop promptly responding to incoming SYNs. I
> don't think the problem is maxing out maxconn, because our connect timeout
> is 3s (no explicit queue timeout), and there isn't even a 503 response, and
> attempting to connect through haproxy during the stall times out way longer
> than 3s.

The fact that you see the SYN leave the machine makes me think that the
TIME_WAITs are not local but are causing trouble on their way to the server.
If you can tcpdump on the server it will help a lot : if in response to a
SYN you see an ACK instead of a SYN-ACK, it means the SYN was mangled in
the middle by something (changed source port or sequence number). Having
haproxy actively send the RST will work around the issue, but keep in mind
that once in a while an RST will be lost and the faulty component in the
middle will reject one connection setup from time to time. Please double-
check the support for TCP timestamps, as having them is a much more
reliable workaround to this situation (and then you're certain that even
if some RSTs are lost it will not cause trouble).

Cheers,
Willy


Reply via email to