Re: dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier

2018-06-14 Thread Willy Tarreau
Hi Pieter,

On Tue, Jun 12, 2018 at 08:09:08PM +0200, PiBa-NL wrote:
> Is there something i can do to find out more info if it happens again? Or
> maybe before that build with more specific debug info so if it happens again
> more info would be retrievable.?.
> (My usual way of 'debugging' with the relatively easy to reproduce issues is
> just cramming a lot of printf statements into the code until something makes
> sense., but with a issue that only happens once in a blue moon (sofar), that
> doesn't work very well...)

Building with -DDEBUG_THREAD and -DDEBUG_MEMORY can help to get a more
exploitable core file which will reveal where a given lock was already
taken. Be careful that it will slightly increase the CPU usage though.

> For the moment it hasn't happened again. i suspend/resume the vm it happened
> on almost daily (that's running on my workstation witch is shutdown
> overnight) the vm also has haproxy running on it and some other stuff..
> 
> If it does happen again, would there be any other gdb information helpful?
> Inspecting specific variables or creating a memory dump or something.? 
> (please give a hint about the comment/option to call if applicable/possible,
> i'm still a rookie with gdb..) p.s. i'm on FreeBSD not sure if that matters
> for some of gdb's available options or something..

Creating a dump is often quite useful. You can do it with the command
"generate-core-file". A "bt full" and "info thread" will also be needed
as you did last time. I cannot guide you through the commands used to
debug the threads however because I don't know them, I think I remember
you need to print the contents of the lock itself (eg: "p foo->lock").

> Would the complete configuration be helpfull? There is a lot of useless
> stuff in there. because its my test/development test vm, and because of lack
> of it happening again there is no good opportunity to shrink it down to
> specific options/parts..

Well probably not, though if at least to keep it unmodified with the
unstripped executable and the upcoming core, that may be useful afterwards.

Thanks!
Willy



Re: dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier

2018-06-12 Thread PiBa-NL

Hi Willy,

Op 12-6-2018 om 14:31 schreef Willy Tarreau:

This one is not known yet, to the best of my knowledge, or at least
not reported yet.

Okay :) I guess ill keep an eye on if it happens again.

Is there something i can do to find out more info if it happens again? 
Or maybe before that build with more specific debug info so if it 
happens again more info would be retrievable.?.
(My usual way of 'debugging' with the relatively easy to reproduce 
issues is just cramming a lot of printf statements into the code until 
something makes sense., but with a issue that only happens once in a 
blue moon (sofar), that doesn't work very well...)


For the moment it hasn't happened again. i suspend/resume the vm it 
happened on almost daily (that's running on my workstation witch is 
shutdown overnight) the vm also has haproxy running on it and some other 
stuff..


If it does happen again, would there be any other gdb information 
helpful? Inspecting specific variables or creating a memory dump or 
something.?  (please give a hint about the comment/option to call if 
applicable/possible, i'm still a rookie with gdb..) p.s. i'm on FreeBSD 
not sure if that matters for some of gdb's available options or something..


Would the complete configuration be helpfull? There is a lot of useless 
stuff in there. because its my test/development test vm, and because of 
lack of it happening again there is no good opportunity to shrink it 
down to specific options/parts..


Thanks for your analysis, though i'm not sure what to do with the info 
at the moment, i do hope you guys together can find a likely culprit 
from the information given ;).


Regards,
PiBa-NL (Pieter)



Re: dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier

2018-06-12 Thread Willy Tarreau
Hi Pieter,

On Mon, Jun 11, 2018 at 10:48:25PM +0200, PiBa-NL wrote:
> Hi List,
> 
> I've got no clue how i got into this state ;) and maybe there is nothing
> wrong..(well i did resume a VM that was suspended for half a day..)
> 
> Still thought it might be worth reporting, or perhaps its solved already as
> there are a few fixes for threads after the 6-6 snapshot that i build with..
> Sometimes all that some people need is half a idea to find a problem... So
> maybe there is something that needs fixing??

This one is not known yet, to the best of my knowledge, or at least
not reported yet.

> (gdb) info threads
>   Id   Target Id Frame
> * 1    LWP 100660 of process 56253 0x005b0202 in thread_sync_barrier
> (barrier=0x8bc690 ) at src/hathreads.c:109
>   2    LWP 101036 of process 56253 0x0050874a in process_chk_conn
> (t=0x8025187c0, context=0x802482610, state=33) at src/checks.c:2112
>   3    LWP 101037 of process 56253 0x0050b58e in
> enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600,
>     msg=0x7fffdfdfc770 "Health check for server Test-SNI_ipvANY/srv451-4
> failed, reason: Layer4 connection problem, info: \"General socket error
> (Network is unreachable)\", check duration: 0ms, status: 0/2 DOWN") at
> src/checks.c:3396

This is quite odd. Either some quit a function without releasing the
server's lock nor the email_queue's lock (suspicious but possible),
or both just happen to be the same. And at first glance, seeing that
process_chk_conn() is called witha context of 0x802482610 which is
only 16 bytes above enqueue_one_email_alert's queue parameter, this
last possibility suddenly seems quite likely.

So my guess is that we take the server's lock in process_chk_conn(),
that we go down through this :
chk_report_conn_err()
   -> set_server_check_status()
 -> send_email_alert()
  -> enqueue_email_alert()
-> enqueue_one_email_alert()

And here we spin on a lock which very likely is the same eventhough I
have no idea why. I suspect the way the queue or lock is retrieved
there is incorrect and explains the situation with a recursive lock,
but I'm afraid I don't really know since I don't know this part of
the code :-(

Note, it could also be that the queue's lock has never been initialized,
and makes the inner lock block there, with the server's lock held, causing
the second thread to spin and the 3rd one to wait for the barrier. I'll
have to ask for some help on this part. I'm pretty confident that 1.8 is
affected as well.

Thanks,
Willy



dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier

2018-06-11 Thread PiBa-NL

Hi List,

I've got no clue how i got into this state ;) and maybe there is nothing 
wrong..(well i did resume a VM that was suspended for half a day..)


Still thought it might be worth reporting, or perhaps its solved already 
as there are a few fixes for threads after the 6-6 snapshot that i build 
with..
Sometimes all that some people need is half a idea to find a problem... 
So maybe there is something that needs fixing??


Haproxy running with 3 threads at 300% cpu usage, .. some lua, almost no 
traffic, in a vm that just resumed operation and is still going through 
its passes to initialize its nic's and some stuff that noticed the clock 
jumped on its back and its dhcp lease expired or something like that.. 
anyhow lots of things going on at that moment..


Below some of the details ive got about the threads, one is spinning, 
the others seemingly waiting for spin_locks.?.


Like i wrote, not sure if its 'something' and i don't know yet if i can 
reproduce it a second time.
If more info is needed, please let me know and ill try and provide it. 
But at the moment its a 1 time occurrence, i think..

If there is nothing obvious wrong, for now maybe just ignore this mail.
Also ill update to latest snapshot 2018/06/08. Maybe i wont see it ever 
again :).


haproxy -vv
HA-Proxy version 1.9-dev0-cc0a957 2018/06/05
Copyright 2000-2017 Willy Tarreau 

Build options :
  TARGET  = freebsd
  CPU = generic
  CC  = cc
  CFLAGS  = -pipe -g -fstack-protector -fno-strict-aliasing 
-fno-strict-aliasing -Wdeclaration-after-statement -fwrapv 
-fno-strict-overflow -Wno-address-of-packed-member -Wno-null-dereference 
-Wno-unused-label -DFREEBSD_PORTS
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1 
USE_ACCEPT4=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_STATIC_PCRE=1 
USE_PCRE_JIT=1


Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), 
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Built with PCRE version : 8.40 2017-01-11
Running on PCRE version : 8.40 2017-01-11
PCRE library supports JIT : yes
Built with multi-threading support.
Encrypted password support via crypt(3): yes
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY
Built with Lua version : Lua 5.3.4
Built with OpenSSL version : OpenSSL 1.0.2k-freebsd  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2o-freebsd  27 Mar 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2

Available polling systems :
 kqueue : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
    [TRACE] trace
    [COMP] compression
    [SPOE] spoe



(gdb) info threads
  Id   Target Id Frame
* 1    LWP 100660 of process 56253 0x005b0202 in 
thread_sync_barrier (barrier=0x8bc690 ) at 
src/hathreads.c:109
  2    LWP 101036 of process 56253 0x0050874a in 
process_chk_conn (t=0x8025187c0, context=0x802482610, state=33) at 
src/checks.c:2112
  3    LWP 101037 of process 56253 0x0050b58e in 
enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600,
    msg=0x7fffdfdfc770 "Health check for server 
Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem, 
info: \"General socket error (Network is unreachable)\", check duration: 
0ms, status: 0/2 DOWN") at src/checks.c:3396

(gdb) next
110 in src/hathreads.c
(gdb) next
109 in src/hathreads.c
(gdb) next
110 in src/hathreads.c
(gdb) next
109 in src/hathreads.c
(gdb) next
110 in src/hathreads.c
(gdb) next
109 in src/hathreads.c
(gdb) next


Command name abbreviations are allowed if unambiguous.
(gdb) thread 1
[Switching to thread 1 (LWP 100660 of process 56253)]
#0  thread_sync_barrier (barrier=0x8bc690 ) 
at src/hathreads.c:109

109 src/hathreads.c: No such file or directory.
(gdb) bt full
#0  thread_sync_barrier (barrier=0x8bc690 ) 
at src/hathreads.c:109

    old = 7
#1  0x005b0038 in thread_enter_sync () at src/hathreads.c:122
    barrier = 1
#2  0x0051737c in sync_poll_loop () at src/haproxy.c:2380
No locals.
#3  0x005172ed in run_poll_loop () at src/haproxy.c:2432
    next = -357534104
    exp = -357534104
#4  0x00514670 in run_thread_poll_loop (data=0x802491380) at 
src/haproxy.c:2462

    start_lock = 0
    ptif = 0x8af8f8 
    ptdf = 0x7fffec80
#5  0x00511199 in main (argc=10, argv=0x7fffec28) at 
src/haproxy.c:3052

    tids = 0x802491380
    threads = 0x8024831a0
    i = 3
    err = 0
    retry = 200
    limit = {rlim_cur = 6116, rlim_max = 6116}
    errmsg =