I setup hoststated earlier this week to provide load balancing and
fail over for a few Linux web servers.  It went fairly smoothly,
except that one of the Linux machines only passed the 'check http "/"
code 200' test about 50% of the time.  Just using 'check tcp' worked
fine, and I saw the same results from another 4.2 box, and even from a
4.3-beta snapshot using relayd.  I could run 'curl -I http://host/' on
any of the OpenBSD boxes in a loop just fine, but the moment I started
hoststated/relayd, curl would start failing about 50% of the time too.
 The SYN packets were showing up in tcpdump on the Linux machine's
interface, but the kernel would just randomly refuse to respond.

Because curl ran fine right up until hoststated was started, I assumed
it was hoststated's fault for the longest time.  But after giving up
trying to find a bug there, I discovered that on the misbehaving Linux
box, the net.ipv4.tcp_tw_recycle=1 sysctl was enabled.

Apparently, this flag changes how Linux handles sockets in TIME-WAIT
state (and violates the TCP specification, according to the sparse
documentation), which I'm guessing doesn't play nicely with OpenBSD's
sequence number randomization.  It was originally set because one of
the database vendors we spoke with suggested a bunch of sysctl changes
for optimization (some necessary like fixing memory overcommit), but
also bad ones like that.  (Searching google shows a lot of hits for
people mindlessly suggesting to enable tcp_tw_recycle.)

Just thought I'd mention this in case it saves someone else the same
frustrating experience.

Reply via email to