Re: Haproxy 1.5.4 unable to accept new TCP request, backlog full, tens of thousands close_wait connection

Willy Tarreau Mon, 24 Apr 2017 08:11:36 -0700

On Mon, Apr 24, 2017 at 08:33:14PM +0800, jaseywang wrote:
> Before 08:24:30 PM, the data is normal, after that I began the benchmark,
> here is data:
> https://www.dropbox.com/s/yu2jsirv104ytwp/1.7.5-tcp-2?dl=0
> https://www.dropbox.com/s/8dap4kus5sj6by7/1.7.5-sess-2?dl=0


These ones also show tons of expired tasks marked "running" and still
not processed. There's something preventing them from leaving the run
queue as fast as possible. I'm seeing only two possible explanations
to this since we process at most 200 tasks in a round in the run queue
before going back to the polling loop, either the poll() loop forces
you to wait for 1 second even when the tasks are here, which could be
explained by the recently encountered integer overflow bug making
task_is_expired() fail in the poll loop, or you have some tasks stuck
running in the run queue.

> Below is the data during benchmark:
> *maxsock = *1000036; *maxconn = *500000; *maxpipes = *0
> current conns = 7488; current pipes = 0/0; conn rate = 322/sec
> Running tasks: 7366/7449; idle = 0 %

This one tends to rule out the task_is_expired() bug because the idle
time is null, so you're running at 100% CPU. But 100% CPU for 322 conn/s
seems almost impossible to me. Even my ALOHA-pocket powered over USB
and running on a 400 MHz fanless MIPS does 3 or 5 times more.

So in short the problem is that the CPU is saturated processing a very
lightly load, so much that it even has difficulties trying to actually
close the sessions at the end of the transfers!

I think you'll have to run haproxy through a profiler like gprof to find
where the CPU time is spent, more on that below.

> > Second, for an unknown reason, the source address reported in "show sess"
> > backends shows the CDN instead of haproxy. I double-checked the code to be
> > sure and it's the correct address that is reported. Are you using some form
> > of transparent proxying that didn't appear in your first config, like
> > "usesrc clientip" ?
> 
> 
>  Sorry, it's my fault, I replace the confidential ip by mistake, it should
> be the haproxy ip

OK no problem.

> > Third, you have countless SYN_RECV on port 443. There can be two
> > explanations :
> >   - too small a socket backlog
> >   - high loss rate between the CDN and your haproxy.
> 
> 
> Even I set backlog as large as 40000, it still soon overflowed.

With a saturated CPU I'm not surprized.

> The latency between CDN and haproxy is around 33ms, no packet loss after
> sending 1000+ icmp packet.

OK, at 2k conn/s, that gives around 60 conns constantly in syn-recv,
which matches what I saw, so no alarm here.

> > Regardless, I'm seeing one strange thing. There are a lot of "state=DIS" in
> > the output. This is a transient state which should not remain for a long
> > time,
> > it's used to report an imminent close. It indicates that the client-facing
> > connection was just terminated, but not executed.
> > Strangely, these tasks are present in the run queue but were delayed. Would
> > you happen to build with Clang ? We've faced an integer overflow bug on
> > very
> > recent versions.
> 
> 
> I install the 1.7.5 version via this rpm spec:
> https://github.com/DBezemer/rpm-haproxy

Hmmm OK.

Then please rebuild from the same sources with these options (derived from
the package above plus a few to help debugging) :

 $ make CPU="generic" TARGET="linux2628" USE_OPENSSL=1 USE_PCRE=1 USE_ZLIB=1 
USE_LINUX_TPROXY=1 DEFINE=-DTCP_USER_TIMEOUT=18 ARCH_FLAGS="-pg" 
CPU_CFLAGS.generic="-O2 -fwrapv"

And use the resulting executable instead. Be sure to run it from a
writable directory (/tmp, /dev/shm, etc). And after your test, kill
it softly using "pkill -USR1 haproxy". It will dump a "gmon.out" file
that you can now inspect using gprof :

    gprof haproxy gmon.out

The time spent in functions isn't very accurate for short calls but
it gives a good idea overall. The number of calls is very useful.

Willy

Re: Haproxy 1.5.4 unable to accept new TCP request, backlog full, tens of thousands close_wait connection

Reply via email to