Hi,

On Fri, Nov 20, 2015 at 04:11:48PM +0800, baiyang wrote:
> Sorry for send it again, mail.formilux.org told me my attachment is too 
> large: /var/spool/mail/archives-haproxy: error writing message: File too large
> So I striped it.

Thanks. I don't remember the file size limit but the list delivers to about
1000 persons so for sure we have to set sane limits.

> Unfortunately, we have been downgraded to 1.5.14, but the issue still there:
> HA-Proxy version 1.5.14 2015/07/02
> Copyright 2000-2015 Willy Tarreau <wi...@haproxy.org>

Ah interesting, at least it makes more sense!

> Build options :
(...)

I'm not seeing anything wrong there.

> # uname -a
> Linux WD-G0-SRP1 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 
> x86_64 x86_64 x86_64 GNU/Linux

This one seems to have missed 3 years of bugfixes but anyway I don't see
how any kernel bug could make haproxy fail, and if it did we'd have to
find a workaround anyway (or a way to detect it).

> Again, there are some 15 minutes timeout records at the end of the spike:
> Nov 20 11:04:56 WD-G0-SRP1 haproxy[28443]: 125.71.115.147:15351 
> [20/Nov/2015:10:48:43.099] http-in WebSvr/ws0 17791/0/0/3/973281 200 251352 - 
> - CD-- 51/7/1/0/0 0/0 "GET /whitedolphin/wd.dbc.pdf HTTP/1.1"
> ...
> Nov 20 11:05:14 WD-G0-SRP1 haproxy[28443]: 125.71.115.147:15352 
> [20/Nov/2015:10:48:43.106] http-in WebSvr/ws0 17550/0/0/3/991660 200 405432 - 
> - CD-- 50/6/0/0/0 0/0 "GET /whitedolphin/wd.overview.pdf HTTP/1.1"
> 
> But we also find a 15 minutes timeout at the early stage of the spike:
> Nov 20 10:52:31 WD-G0-SRP1 haproxy[28443]: 125.71.115.147:15231 
> [20/Nov/2015:10:36:10.183] https-in~ g0n0/n0 52555/0/0/1/981749 200 138069 - 
> - CD-- 29/24/11/11/0 0/0 "GET /mobile/lib/sencha-touch-all.js?v=1 HTTP/1.1"

Note, it's not a timeout in haproxy, the client closed first ('CD').
I find this very strange here because the request time is high,
indicating a reused keep-alive connection. If the total time is that
long, it probably means a low transfer rate, which doesn't make much
sense. This doesn't match an idle connection because idle connections
wouldn't produce a log. Or we're facing an asynchronous error that is
reported once we wake-up due to the timeout, though I have no idea when
this could happen.

Anyway your logs are very useful because they make me scratch my head.

> Again, We have been reviewed all log records involved in this case, and
> nothing special there, they are all working fine.

OK.

> I suspect if the connection is broken by a link level or physical level
> interruption (e.g.: 4G signal lost, Unplug the network cable, etc.), or other
> cases cause the Four-way handshake or even the RST signal could not to be
> sent.

I suspect it has something to do with the very long timeout. Maybe
we're facing an error while closing on timeout and we're looping on
this. I find this quite strange but given that we had no such report
in more than one year of 1.5 deployed everywhere, I'm trying to find
what could cause this. Network timeouts are fairly common these days
with mobile devices everywhere. Long connections are common as well
in certain environments (websocket, RDP, etc).

When the problem happens, could you please try to dump the output
of "show sess all" sent on the CLI ? Typically :

    $ echo "show sess all" | socat - /var/run/haproxy.stat > all-sess.txt

(or wherever you put it).

Be careful, the list can be huge, no need to send it to the mailing list.
It may also disclose some IP information that you don't necessarily want
to disclose publicly. I'll be interested in matching this output against
the log line indicating the issue. The session's flags will greatly help.

> Then some infinity look (e.g.: spin lock?) may be triggered within the
> haproxy's timeout thread until it is surely timed out?

No, there's no lock, no thread or any such thing here. The fact that it
continues to work indicates a missed event that we attempt to process
and that cannot be processed.

Note, I remember having seen a kernel bug affecting epoll a long time
ago where a deleted even was still reported. It was a problem because
when epoll_wait() reports a non-existing file descriptor, you try to
close it but you can't, as it doesn't exist anymore. So that makes me
think that the output of one second of strace when the trouble appears
will also help :

  $ strace -ttT -o haproxy-trace.txt -p $(pidof haproxy)
  <wait one second>
  [Ctrl-C]
  
You'll have to press Ctrl-C to stop it once you run it. It can be large
as well if the process is spinning. But it will also tell us a lot of
things.

Best regards,
Willy


Reply via email to