Re: cygwin compilation error
I messed up with commit message. One more try ср, 8 мая 2019 г. в 11:33, Илья Шипицин : > small fix > > ср, 8 мая 2019 г. в 11:12, Willy Tarreau : > >> On Wed, May 08, 2019 at 11:09:04AM +0500, ??? wrote: >> > ??, 8 ??? 2019 ?. ? 11:06, Willy Tarreau : >> > >> > > On Wed, May 08, 2019 at 10:59:20AM +0500, ??? wrote: >> > > > travis-ci supports windows builds. >> > > >> > > cool! >> > > >> > >> > my current roadmap is >> > >> > 1) patch fixes SSL variants (already sent to list). without it we are >> NOT >> > building LibreSSL at all (i.e. we use default openssl-1.0.2 for all >> builds) >> >> Pushed just now. >> >> > 2) BoringSSL >> > >> > 3) update gcc, clang, enable sanitizers >> > >> > 4) cygwin >> >> OK, sounds good. >> >> Thanks, >> Willy >> > From ad9961e92c692430272c9088a49759c889dac6f1 Mon Sep 17 00:00:00 2001 From: Ilya Shipitsin Date: Wed, 8 May 2019 11:32:02 +0500 Subject: [PATCH] BUILD: do not use "RAND_keep_random_devices_open" when building against LibreSSL --- src/haproxy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/haproxy.c b/src/haproxy.c index 4c371254..c8a8aaf0 100644 --- a/src/haproxy.c +++ b/src/haproxy.c @@ -590,7 +590,7 @@ void mworker_reload() ptdf->fct(); if (fdtab) deinit_pollers(); -#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) +#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) && !defined LIBRESSL_VERSION_NUMBER) if (global.ssl_used_frontend || global.ssl_used_backend) /* close random device FDs */ RAND_keep_random_devices_open(0); -- 2.20.1
Re: cygwin compilation error
small fix ср, 8 мая 2019 г. в 11:12, Willy Tarreau : > On Wed, May 08, 2019 at 11:09:04AM +0500, ??? wrote: > > ??, 8 ??? 2019 ?. ? 11:06, Willy Tarreau : > > > > > On Wed, May 08, 2019 at 10:59:20AM +0500, ??? wrote: > > > > travis-ci supports windows builds. > > > > > > cool! > > > > > > > my current roadmap is > > > > 1) patch fixes SSL variants (already sent to list). without it we are NOT > > building LibreSSL at all (i.e. we use default openssl-1.0.2 for all > builds) > > Pushed just now. > > > 2) BoringSSL > > > > 3) update gcc, clang, enable sanitizers > > > > 4) cygwin > > OK, sounds good. > > Thanks, > Willy > From ad9961e92c692430272c9088a49759c889dac6f1 Mon Sep 17 00:00:00 2001 From: Ilya Shipitsin Date: Wed, 8 May 2019 11:32:02 +0500 Subject: [PATCH] BUILD: do not use && !defined LIBRESSL_VERSION_NUMBER) when building against LibreSSL --- src/haproxy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/haproxy.c b/src/haproxy.c index 4c371254..c8a8aaf0 100644 --- a/src/haproxy.c +++ b/src/haproxy.c @@ -590,7 +590,7 @@ void mworker_reload() ptdf->fct(); if (fdtab) deinit_pollers(); -#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) +#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) && !defined LIBRESSL_VERSION_NUMBER) if (global.ssl_used_frontend || global.ssl_used_backend) /* close random device FDs */ RAND_keep_random_devices_open(0); -- 2.20.1
Re: [PATCH 1/1] BUILD: travis-ci bugfixes and improvements
On Tue, May 07, 2019 at 01:42:43AM +0500, chipits...@gmail.com wrote: > From: Ilya Shipitsin > > Call missing scripts/build-ssl.sh (which actually builds SSL variants) > Enable OpenSSL, LibreSSL builds caching, it saves a bunch of time > LibreSSL builds are not allowed to fail anymore > Add openssl to osx builds Merged, thanks! Willy
Re: cygwin compilation error
On Wed, May 08, 2019 at 11:09:04AM +0500, ??? wrote: > ??, 8 ??? 2019 ?. ? 11:06, Willy Tarreau : > > > On Wed, May 08, 2019 at 10:59:20AM +0500, ??? wrote: > > > travis-ci supports windows builds. > > > > cool! > > > > my current roadmap is > > 1) patch fixes SSL variants (already sent to list). without it we are NOT > building LibreSSL at all (i.e. we use default openssl-1.0.2 for all builds) Pushed just now. > 2) BoringSSL > > 3) update gcc, clang, enable sanitizers > > 4) cygwin OK, sounds good. Thanks, Willy
Re: cygwin compilation error
ср, 8 мая 2019 г. в 11:06, Willy Tarreau : > On Wed, May 08, 2019 at 10:59:20AM +0500, ??? wrote: > > travis-ci supports windows builds. > > cool! > my current roadmap is 1) patch fixes SSL variants (already sent to list). without it we are NOT building LibreSSL at all (i.e. we use default openssl-1.0.2 for all builds) 2) BoringSSL 3) update gcc, clang, enable sanitizers 4) cygwin > > > I will add such build a bit later (after > > we settle with current travis-ci fixes) > > ...and this cygwin build issue :-) > > Willy >
Re: cygwin compilation error
On Wed, May 08, 2019 at 10:59:20AM +0500, ??? wrote: > travis-ci supports windows builds. cool! > I will add such build a bit later (after > we settle with current travis-ci fixes) ...and this cygwin build issue :-) Willy
Re: haproxy 2.0 docker images
Hi Aleks, On Mon, May 06, 2019 at 08:17:23AM +0200, Aleksandar Lazic wrote: > > The outputs below raises some questions to me. > > > > * Should in the OPTIONS output also be the EXTRA_OBJS ? That's a good question. I was hesitating but given that the goal is to be able to easily rebuild a similar executable, maybe we should add it indeed. > > * Should PCRE2 be used instead of PCRE ? No opinion :-) > > * Should PRIVATE_CACHE be used in the default build? No, because this one disables inter-process sharing of SSL sessions. > > * Should SLZ be used in the default build? It's just a matter of choice. I personally always build with it for prod servers because it saves a huge amount of memory and some CPU, but it also adds one extra dependency. I'd say that if it doesn't require extra efforts it's worth it. If it adds some packaging burden you can simply drop it and fall back to zlib. > > * Make NS sense in a container image? I don't think so indeed, though it doesn't cost much to keep it, at least so that you use the same build options everywhere. > > * Can DEVICEATLAS 51DEGREES WURFL be used together? > > - From technically point of view >From a technical point of view I don't see any obvious incompatibility. However doing automated builds from all 3 of these might not always be trivial as it will require that you can include these respective libraries, some of which may only be downloaded after registering on their site. Please don't ship an executable built with the dummy libs since it will be useless and misleading (it's only useful for full- featured builds). > > - From license point of view You have to carefully check. I believe at least one of them mentions patents so this can even make the resulting executable look dangerous for some users and make them stay away from your images. Anyway as usual with anything related to licensing, the best advice I could give you is to ask a lawyer :-/ This alone might be a valid reason for not wasting too much time down this road. Cheers, Willy
Re: cygwin compilation error
travis-ci supports windows builds. I will add such build a bit later (after we settle with current travis-ci fixes) ср, 8 мая 2019 г. в 10:52, Willy Tarreau : > Hi, > > On Mon, May 06, 2019 at 12:54:47PM +0300, Gil Bahat wrote: > > Hi, > > > > is cygwin still supported anymore? > > Well, we never know :-) I mean, we're always open to fixes to make it > work as long as they don't impact other platforms. > > > the target seems to be present in the > > Makefiles and I'd love to be able to use it. I'm running into what seems > to > > be a workable linker error: > > > > $ make TARGET=cygwin > > LD haproxy > > src/http_act.o:http_act.c:(.rdata+0x340): multiple definition of > > `.weak.ist_uc.' > > src/ev_poll.o:ev_poll.c:(.rdata+0x20): first defined here > > Aie that's really bad, it means the linker doesn't support weak symbols :-( > Weak symbols are very handy as they are able to be included and linked in > only once if they are used, and not linked if unused. The info I'm finding > on the net suggest that symbols must be resolved at link time, which is the > case here. So maybe it's just a matter of definition. > > I can suggest a few things to try in include/common/ist.h : > > - replace "__weak__" with "weak" just in case it's different there > (I don't even know why I marked it "__weak__", probably just by > mimetism with "__attribute__" and because it worked > > - add "#pragma weak ist_lc" and "#pragma weak ist_uc" in ist.h, > before the definitions > > - add "extern const unsigned char ist_lc[256];" and > "extern const unsigned char ist_uc[256];" before the definitions > > In case one of them is enough to work, we can merge them. > > Thanks, > Willy > >
Re: cygwin compilation error
Hi, On Mon, May 06, 2019 at 12:54:47PM +0300, Gil Bahat wrote: > Hi, > > is cygwin still supported anymore? Well, we never know :-) I mean, we're always open to fixes to make it work as long as they don't impact other platforms. > the target seems to be present in the > Makefiles and I'd love to be able to use it. I'm running into what seems to > be a workable linker error: > > $ make TARGET=cygwin > LD haproxy > src/http_act.o:http_act.c:(.rdata+0x340): multiple definition of > `.weak.ist_uc.' > src/ev_poll.o:ev_poll.c:(.rdata+0x20): first defined here Aie that's really bad, it means the linker doesn't support weak symbols :-( Weak symbols are very handy as they are able to be included and linked in only once if they are used, and not linked if unused. The info I'm finding on the net suggest that symbols must be resolved at link time, which is the case here. So maybe it's just a matter of definition. I can suggest a few things to try in include/common/ist.h : - replace "__weak__" with "weak" just in case it's different there (I don't even know why I marked it "__weak__", probably just by mimetism with "__attribute__" and because it worked - add "#pragma weak ist_lc" and "#pragma weak ist_uc" in ist.h, before the definitions - add "extern const unsigned char ist_lc[256];" and "extern const unsigned char ist_uc[256];" before the definitions In case one of them is enough to work, we can merge them. Thanks, Willy
Re: haproxy-1.9 sanitizers finding
Hi Ilya, On Tue, May 07, 2019 at 11:47:54AM +0500, ??? wrote: > Hello, > > when running regtests against 1.9 branch there are findings (not seen in > master branch) > > *** h10.0 > debug|= > *** h10.0 debug|==16493==ERROR: AddressSanitizer: heap-use-after-free > on address 0x61903c95 at pc 0x006ca207 bp 0x7ffd92124b60 sp > 0x7ffd92124b50 > *** h10.0 debug|WRITE of size 1 at 0x61903c95 thread T0 > *** h10.0 debug|#0 0x6ca206 in update_log_hdr src/log.c:1260 > *** h10.0 debug|#1 0x6ca206 in __send_log src/log.c:1445 > *** h10.0 debug|#2 0x6ca48a in send_log src/log.c:1323 (...) OK these are the same that you reported on master which is fixed there and not backported yet. It should eventually get backported ;-) Thanks, Willy
Re: systemd watchdog support?
Hi guys, On Tue, May 07, 2019 at 10:40:17PM +0200, William Lallemand wrote: > Hi Patrick, > > On Tue, May 07, 2019 at 02:23:15PM -0400, Patrick Hemmer wrote: > > So with the prevalence of the issues lately where haproxy is going > > unresponsive and consuming 100% CPU, I wanted to see what thoughts were > > on implementing systemd watchdog functionality. First, let me tell you I'm also all for a watchdog system. For me, an unresponsive process is the worst thing that can ever happen because it's the hardest one to detect and it takes time to fix it. This is also why I've been working on a lockup detection for the worker processes that is able to produce some context info and possibly an analysable core dump. I expect to have it for 2.0-final, this is important to accelerate finding of such painful bugs and fix them early if any remains. > The master uses a special backend, invisible to the user, which contains 1 > server per worker, it uses the socketpair of the worker for the address. They > are always connected and they can communicate. This architecture allows to > forward commands to the CLI of the worker. > > One of my ideas was to do the equivalent of adding a "check" keyword for each > of these server line. We would have to implement a special check which will > send a CLI command and wait for its response. > > If one of the server does not respond, we could execute the exit-on-failure > procedure. I'd like that we keep a trace of the failed process. Send it a SIGXCPU or SIGABRT, and kill the other ones cleanly. > > The last idea would be to have the watchdog watch the master only, and > > the master watches the workers in turn. If a worker stops responding, > > the master would restart just that one worker. > > > > That's not a good idea to restart only one worker, and that's not possible > with > the current architecture, and too much complicated. In my opinion it's better > to kill everything so systemd can restart properly with Restart=on-failure, > this is what is done when one of the worker segfault, for example. I totally agree. In the past when nbproc was used a lot, we've had many reports of people getting caught by one process dying once in a while, till the point where there were not enough processes left to handle the traffic, making the service barely responsive but still up. This gives a terrible image of a hosted service outside, while a dead process would be detected, failed-over or restarted. Cheers, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, On Tue, May 07, 2019 at 07:08:47PM +0200, Maciej Zdeb wrote: > Hi, > > I've got another bug with 100% CPU on HAProxy process, it is built from > HEAD of 1.9 branch. > > One of processes stuck in infinite loop, admin socket is not responsive so > I've got information only from gdb: > > 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589 > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > (gdb) n (...) CCing Olivier. Olivier, I'm wondering if this is not directly related to what you addressed with this fix merged in 2.0 but not backported : 998410a ("BUG/MEDIUM: h2: Revamp the way send subscriptions works.") >From what I'm seeing there's no error, the stream is in the sending list, there's no blocking flag, well, everything looks OK, but we're looping on SUB_CALL_UNSUBSCRIBE which apprently should not if I understand it right. Do you think we should backport this patch ? Remaining of the trace below for reference. THanks, Willy --- > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2586list_for_each_entry(h2s, &h2c->send_list, list) { > (gdb) > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > (gdb) > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2586list_for_each_entry(h2s, &h2c->send_list, list) { > (gdb) > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) p h2c > $1 = (struct h2c *) 0x2e8ff30 > (gdb) p *h2c > $2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, > flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht > = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, > dfl = 0, > dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size > = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = > 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, > timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, > nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, > streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = > 0x3093c18}, > fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = > 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list > = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = > 0x0, events = 1}} > (gdb) n > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > (gdb) p *h2s > $3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m > = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next = > 0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > {0x2a72250, > 0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx > = 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode = > H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size = > 0, > area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle > = 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n > = 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}} > (gdb) n > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) p *h2c > $4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, > flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht > = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, > dfl = 0, > dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size > = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = > 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, > timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, > nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, > streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = > 0x3093c18}, > fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = > 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list > = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = > 0x0, events = 1}} > (gdb) n > 2586list_for_each_entry(h2s, &h2c->send_list, list) { > (gdb) n > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) n > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > > > HAProxy info: > HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/ > Build options : > TARGET = linux2628 > CPU = generic > CC = gcc > CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement > -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter > -W
Re: systemd watchdog support?
Hi Patrick, On Tue, May 07, 2019 at 02:23:15PM -0400, Patrick Hemmer wrote: > So with the prevalence of the issues lately where haproxy is going > unresponsive and consuming 100% CPU, I wanted to see what thoughts were > on implementing systemd watchdog functionality. > > In our case, haproxy going unresponsive is extremely problematic as our > clustering software (pacemaker+systemd) sees the service still running, > and doesn't realize it needs to restart the service or fail over. > We could look into implementing some sort of custom check resource in > pacemaker, but before going down that route I wanted to explore the > systemd watchdog functionality. > > The watchdog is implemented by periodically sending "WATCHDOG=1" on the > systemd notification socket. However there are a few different ways I > can see this being implemented. > > We could put this in the master control process, but this only tells us > if the master is functioning, not the workers, which are what really matter. > > So the next thought would be for all of the workers to listen on a > shared socket. The master would periodically send a request to that > socket, and as long as it gets a response, it pings the watchdog. This > tells us that there is at least one worker able to accept traffic. > > However if a frontend is bound to a specific worker, then that would > frontend would be non-responsive, and the watchdog wouldn't restart the > service. For that the worker would have to send a request to each worker > separately, and require a response from all of them before it pings the > watchdog. This would be better able to detect issues, but for some > people who aren't using any bound-to-process frontends, they would be > able to handle failure of a single worker and potentially schedule a > restart/reload at a less impactful time. > The master uses a special backend, invisible to the user, which contains 1 server per worker, it uses the socketpair of the worker for the address. They are always connected and they can communicate. This architecture allows to forward commands to the CLI of the worker. One of my ideas was to do the equivalent of adding a "check" keyword for each of these server line. We would have to implement a special check which will send a CLI command and wait for its response. If one of the server does not respond, we could execute the exit-on-failure procedure. > > The last idea would be to have the watchdog watch the master only, and > the master watches the workers in turn. If a worker stops responding, > the master would restart just that one worker. > That's not a good idea to restart only one worker, and that's not possible with the current architecture, and too much complicated. In my opinion it's better to kill everything so systemd can restart properly with Restart=on-failure, this is what is done when one of the worker segfault, for example. > > Any thoughts on the matter, or do we not want to do this, and rely on a > custom check in the cluster management software? > > -Patrick > -- William Lallemand
Re: [PATCH] wurfl device detection build fixes and dummy library
# Please type your reply above this line # You are registered as a cc on this help desk request and are thus receiving email notifications on all updates to the request. Reply to this email to add a comment to the request. -- Aaron Park, May 7, 15:30 EDT Hi Willy, I just wanted to check in to let you know that our engineers are continuing to build a patch to address the varying issues you are seeing. Because this build will take some time, unless you have any other questions, we will close out this ticket for now and will keep you informed of any updates we have for you. Thanks, Aaron e: supp...@scientiamobile.com ScientiaMobile Customer Support Team -- Willy Tarreau, Apr 19, 10:49 EDT Sorry, with the patches this time. Willy Attachment(s): 0007-WIP-wurfl-pass-fPIC-when-compiling.patch - https://support.scientiamobile.com/attachments/token/fAR7m6yvCQlVHA4dqp23I5sJS/?name=0007-WIP-wurfl-pass-fPIC-when-compiling.patch 0008-WIP-wurfl-fix-broken-symlinks.patch - https://support.scientiamobile.com/attachments/token/2Nrk5JLukGKWfCDj6rXLNFeik/?name=0008-WIP-wurfl-fix-broken-symlinks.patch 0009-WIP-wurfl-address-build-issues-by-doing-a-static-lib.patch - https://support.scientiamobile.com/attachments/token/mJXKrE2ecZHS7RjwXwTwbbQHi/?name=0009-WIP-wurfl-address-build-issues-by-doing-a-static-lib.patch 0010-WIP-wurfl-indicate-in-haproxy-vv-the-wurfl-version-i.patch - https://support.scientiamobile.com/attachments/token/Pc7IJxxno71IOlfHSByian8dp/?name=0010-WIP-wurfl-indicate-in-haproxy-vv-the-wurfl-version-i.patch 0011-WIP-wurfl-move-wurfl.h-into-wurfl-to-maintain-direct.patch - https://support.scientiamobile.com/attachments/token/qkvpDMK65KKVXVJsJyrNSxp7t/?name=0011-WIP-wurfl-move-wurfl.h-into-wurfl-to-maintain-direct.patch 0012-WIP-wurfl-mention-how-to-build-the-dummy-lib-in-the-.patch - https://support.scientiamobile.com/attachments/token/IWptQoEt33SpfGBPLJWNh5ZNL/?name=0012-WIP-wurfl-mention-how-to-build-the-dummy-lib-in-the-.patch 0013-WIP-wurfl-rename-makefile-to-Makefile.patch - https://support.scientiamobile.com/attachments/token/BD9188uXGpWCBfCxcungaYVoJ/?name=0013-WIP-wurfl-rename-makefile-to-Makefile.patch -- Willy Tarreau, Apr 19, 10:46 EDT Hi Paul, On Thu, Apr 18, 2019 at 02:46:17PM +0200, Paul Stephen Borile wrote: > please find attached to this email the 6 patches that cover various areas > of restyling of > the WURFL device detection feature for HAProxy. All patches can be back > ported to 1.9 if necessary. > Last patch is a dummy WURFL library that can be used to build/run haproxy > compiled with the USE_WURFL option to make easier checking for any build > problem in the future. > We'll try to do the same and make sure that the module does not break > builds again as happened in the past. So I gave a look to this patch set and had to perform a few adjustments to make it work but now it looks OK. I'm attaching the changes I made so that you can review them, they're all related to the dummy lib in order to 1) fix its build and 2) ease the testing without having to modify the build environment (since adding non-standard stuff into /usr/include or /usr/lib is a no-go on most development environments). I figured that it was much simpler to build a ".a" from the file so that it can naturally be loaded by the regular build process. I added the ability to report the libwurfl version in "haproxy -vv" and when the dummy lib is detected, it's explicitly mentioned "dummy library" there so that you don't have to deal with false positives when users report issues. I also added a little bit of doc explaining to haproxy devs how to build with wurfl. This way I think it could be added by default to any developer's build script so that it never breaks in the future. I'm attaching my changes. I'm fine with retrofitting them into your patches if they look OK to you. Please just let me know if you're OK to go with this (and if you're OK with me backporting this to 1.9 so that we can fix 1.9 once for all). Thanks! Willy PS: I've CCed the contact address in the maintainers file just to verify that there is no typo there, please confirm that it was properly received.' This email is a service from ScientiaMobile. [PM6MEL-7Z9M]
Re: HAProxy 1.9.6 unresponsive
Hi Patrick, On Tue, May 07, 2019 at 02:01:33PM -0400, Patrick Hemmer wrote: > Just in case it's useful, we had the issue recur today. However I gleaned a > little more information from this recurrence. Provided below are several > outputs from a gdb `bt full`. The important bit is that in the captures, the > last frame which doesn't change between each capture is the `si_cs_send` > function. The last stack capture provided has the shortest stack depth of > all the captures, and is inside `h2_snd_buf`. Thank you. At first glance this remains similar. Christopher and I have been studying these issues intensely these days because they have deep roots into some design choices and tradeoffs we've had to make and that we're relying on, and we've come to conclusions about some long term changes to address the causes, and some fixes for 1.9 that now appear valid. We're still carefully reviewing our changes before pushing them. Then I think we'll emit 1.9.8 anyway since it will already fix quite a number of issues addressed since 1.9.7, so for you it will probably be easier to try again. > Otherwise it's still the behavior is the same as last time with `strace` > showing absolutely nothing, so it's still looping. I'm not surprised. We managed to break that loop in a dirty way a first time but it came with impacts (some random errors could be spewed depending on the frame sizes, which is obviously not acceptable). But yes, this loop has no way to give up. That's the second argument convincing me of finishing the watchdog so that at least it dies when this happens! Expect some updates on this this week. Cheers, Willy
systemd watchdog support?
So with the prevalence of the issues lately where haproxy is going unresponsive and consuming 100% CPU, I wanted to see what thoughts were on implementing systemd watchdog functionality. In our case, haproxy going unresponsive is extremely problematic as our clustering software (pacemaker+systemd) sees the service still running, and doesn't realize it needs to restart the service or fail over. We could look into implementing some sort of custom check resource in pacemaker, but before going down that route I wanted to explore the systemd watchdog functionality. The watchdog is implemented by periodically sending "WATCHDOG=1" on the systemd notification socket. However there are a few different ways I can see this being implemented. We could put this in the master control process, but this only tells us if the master is functioning, not the workers, which are what really matter. So the next thought would be for all of the workers to listen on a shared socket. The master would periodically send a request to that socket, and as long as it gets a response, it pings the watchdog. This tells us that there is at least one worker able to accept traffic. However if a frontend is bound to a specific worker, then that would frontend would be non-responsive, and the watchdog wouldn't restart the service. For that the worker would have to send a request to each worker separately, and require a response from all of them before it pings the watchdog. This would be better able to detect issues, but for some people who aren't using any bound-to-process frontends, they would be able to handle failure of a single worker and potentially schedule a restart/reload at a less impactful time. The last idea would be to have the watchdog watch the master only, and the master watches the workers in turn. If a worker stops responding, the master would restart just that one worker. Any thoughts on the matter, or do we not want to do this, and rely on a custom check in the cluster management software? -Patrick
Re: HAProxy 1.9.6 unresponsive
*From:* Willy Tarreau [mailto:w...@1wt.eu] *Sent:* Monday, May 6, 2019, 08:42 EDT *To:* Patrick Hemmer *Cc:* haproxy@formilux.org *Subject:* HAProxy 1.9.6 unresponsive On Sun, May 05, 2019 at 09:40:02AM +0200, Willy Tarreau wrote: With this said, after studying the code a little bit more, I'm seeing a potential case where if we'd have a trailers entry in the HTX buffer but no end of message, we could loop forever there not consuming this block. I have no idea if this is possible in an HTX message, I'll ask Christopher tomorrow. In any case we need to address this one way or another, possibly reporting an error instead if required. Thus I'm postponing 1.9.8 for tomorrow. So the case is indeed possible and at the moment all we can do is try to minimize the probability to produce it :-( The issue is caused by the moment we've received the end of trailsers but not the end of the mesage. From the H2 protocol perspective if we've sent the END_STREAM flag, the stream is closed, and a closed stream gets detached and cannot receive new traffic, so at best we'll occasionally close too early and report client failures at the upper layers while everything went OK. We cannot send trailers without the END_STREAM flag since no frame may follow. Abusing CONTINUATION is out of question here as this would require to completely freeze the whole connection (including control frames) for the time it takes to get this final EOM block. I thought about simply reporting an error when we're in this situation between trailers and EOM but it will mean that occasionally some chunked responses of sizes close to 16N kB with trailers may err out, which is not acceptable either. For 2.0 we approximately see what needs to be modified to address this situation, but that will not be trivial and not backportable. For 1.9 I'm still trying to figure what the "best" solution is. I may finally end up marking the stream as closed as soon as we see the trailers pushed down. I'm just unsure right now about all the possible consequences and need to study the edge cases. Also I fear that this will be something hard to unroll later, so I'm still studying. Willy Just in case it's useful, we had the issue recur today. However I gleaned a little more information from this recurrence. Provided below are several outputs from a gdb `bt full`. The important bit is that in the captures, the last frame which doesn't change between each capture is the `si_cs_send` function. The last stack capture provided has the shortest stack depth of all the captures, and is inside `h2_snd_buf`. Otherwise it's still the behavior is the same as last time with `strace` showing absolutely nothing, so it's still looping. #0 h1_headers_to_hdr_list (start=0x7f5a4ea6b5fb "grpco\243?", stop=0x7f5a4ea6b5ff "o\243?", hdr=hdr@entry=0x7ffdc58f6400, hdr_num=hdr_num@entry=101, h1m=h1m@entry=0x7ffdc58f63d0, slp=slp@entry=0x0) at src/h1.c:793 ret = state = ptr = end = hdr_count = skip = 0 sol = col = eol = sov = sl = skip_update = restarting = n = v = {ptr = 0x7f5a4eb51453 "LZ\177", len = 140025825685243} #1 0x7f5a4d862539 in h2s_htx_make_trailers (h2s=h2s@entry=0x7f5a4ecc7860, htx=htx@entry=0x7f5a4ea67630) at src/mux_h2.c:4996 list = {{n = {ptr = 0x0, len = 0}, v = {ptr = 0x0, len = 0}} } h2c = 0x7f5a4ec56610 blk = blk_end = 0x0 outbuf = {size = 140025844274259, area = 0x7f5a4d996efb "\205\300~\aHc\320H\001SXH\205\355t\026Lc\310E1\300D\211\351L\211⾃", data = 16472, head = 140025845781936} h1m = {state = H1_MSG_HDR_NAME, flags = 2056, curr_len = 0, body_len = 0, next = 4, err_pos = 0, err_state = 1320431563} type = ret = hdr = 0 idx = 5 start = #2 0x7f5a4d866ef5 in h2_snd_buf (cs=0x7f5a4e9a8980, buf=0x7f5a4e777d78, count=4, flags=) at src/mux_h2.c:5372 h2s = orig_count = total = 16291 ret = htx = 0x7f5a4ea67630 blk = btype = idx = #3 0x7f5a4d8f4be4 in si_cs_send (cs=cs@entry=0x7f5a4e9a8980) at src/stream_interface.c:691 send_flag = conn = 0x7f5a4e86f4c0 si = 0x7f5a4e777f98 oc = 0x7f5a4e777d70 ret = did_send = 0 #4 0x7f5a4d8f6305 in si_cs_io_cb (t=, ctx=0x7f5a4e777f98, state=) at src/stream_interface.c:737 si = 0x7f5a4e777f98 cs = 0x7f5a4e9a8980 ret = 0 #5 0x7f5a4d925f02 in process_runnable_tasks () at src/task.c:437 t = state = ctx = process = t = max_processed = #6 0x7f5a4d89f6ff in run_poll_loop () at src/haproxy.c:2642 next = exp = #7 run_thread_poll_loop (data=data@entry=0x7f5a4e62a9b0) at src/haproxy.c:
[1.9 HEAD] HAProxy using 100% CPU
Hi, I've got another bug with 100% CPU on HAProxy process, it is built from HEAD of 1.9 branch. One of processes stuck in infinite loop, admin socket is not responsive so I've got information only from gdb: 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) (gdb) n 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2586list_for_each_entry(h2s, &h2c->send_list, list) { (gdb) 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) (gdb) 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2586list_for_each_entry(h2s, &h2c->send_list, list) { (gdb) 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) p h2c $1 = (struct h2c *) 0x2e8ff30 (gdb) p *h2c $2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, dfl = 0, dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = 0x3093c18}, fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = 0x0, events = 1}} (gdb) n 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) (gdb) p *h2s $3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x2a72250, 0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx = 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle = 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n = 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}} (gdb) n 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) p *h2c $4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, dfl = 0, dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = 0x3093c18}, fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = 0x0, events = 1}} (gdb) n 2586list_for_each_entry(h2s, &h2c->send_list, list) { (gdb) n 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) n 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) HAProxy info: HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/ Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits -DIP_BIND_ADDRESS_NO_PORT=24 OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_DL=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Built with OpenSSL version : OpenSSL 1.1.1b 26 Feb 2019 Running on OpenSSL version : OpenSSL 1.1.1b 26 Feb 2019 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3 Built with Lua version : Lua 5.3.5 Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Built with zlib version : 1.2.8 Running on zlib version : 1.2.8 Compression algorithms supported
Re: [External] Re: QAT intermittent healthcheck errors
On 5/7/19 3:35 PM, Marcin Deranek wrote: > Hi Emeric, > > On 5/7/19 1:53 PM, Emeric Brun wrote: >> On 5/7/19 1:24 PM, Marcin Deranek wrote: >>> Hi Emeric, >>> >>> On 5/7/19 11:44 AM, Emeric Brun wrote: Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. >>> Regards, >> >> Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. >>> >>> It did fix fd leak: >>> >>> # ls -al /proc/2565/fd|fgrep dev >>> lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null >>> lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv >>> >>> # systemctl reload haproxy.service >>> # ls -al /proc/2565/fd|fgrep dev >>> lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null >>> lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv >>> >>> # systemctl reload haproxy.service >>> # ls -al /proc/2565/fd|fgrep dev >>> lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null >>> lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv >>> >>> But there are still stuck processes :-( This is with both patches included: >>> for QAT and HAProxy. >>> Regards, >>> >>> Marcin Deranek >> >> Thank you Marcin! Anyway it's was also a bug. >> >> Could you process a 'show fds' command on a stucked process adding the patch >> in attachement. > > I did apply this patch and all previous patches (QAT + HAProxy > ssl_free_engine). This is what I got after 1st reload: > > show proc > # > 8025 master 0 1 0d 00h03m25s > # workers > 31269 worker 1 0 0d 00h00m39s > 31270 worker 2 0 0d 00h00m39s > 31271 worker 3 0 0d 00h00m39s > 31272 worker 4 0 0d 00h00m39s > # old workers > 9286 worker [was: 1] 1 0d 00h03m25s > 9287 worker [was: 2] 1 0d 00h03m25s > 9288 worker [was: 3] 1 0d 00h03m25s > 9289 worker [was: 4] 1 0d 00h03m25s > > @!9286 show fd > 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x23eaae0 > iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0 > 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e1ab0 > iocb=0x4e1ab0(thread_sync_io_handler) tmask=0x umask=0x0 > 20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1601b840 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x1f0ec4f0 > iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL > mux=PASS mux_ctx=0x22ad8630 > 1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bab1f30 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x247e5bc0 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x18883650 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x14476c10 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11a27850 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12008230 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bb0a570 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c94790 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1449e050 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1f00c150 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x15f40550 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x124b6340 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11fe4500 > iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 > 1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c70a60 > iocb=0x4f4d50(
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 5/7/19 1:53 PM, Emeric Brun wrote: On 5/7/19 1:24 PM, Marcin Deranek wrote: Hi Emeric, On 5/7/19 11:44 AM, Emeric Brun wrote: Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. Regards, Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. It did fix fd leak: # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv But there are still stuck processes :-( This is with both patches included: for QAT and HAProxy. Regards, Marcin Deranek Thank you Marcin! Anyway it's was also a bug. Could you process a 'show fds' command on a stucked process adding the patch in attachement. I did apply this patch and all previous patches (QAT + HAProxy ssl_free_engine). This is what I got after 1st reload: show proc # 8025master 0 1 0d 00h03m25s # workers 31269 worker 1 0 0d 00h00m39s 31270 worker 2 0 0d 00h00m39s 31271 worker 3 0 0d 00h00m39s 31272 worker 4 0 0d 00h00m39s # old workers 9286worker [was: 1]1 0d 00h03m25s 9287worker [was: 2]1 0d 00h03m25s 9288worker [was: 3]1 0d 00h03m25s 9289worker [was: 4]1 0d 00h03m25s @!9286 show fd 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x23eaae0 iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e1ab0 iocb=0x4e1ab0(thread_sync_io_handler) tmask=0x umask=0x0 20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1601b840 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x1f0ec4f0 iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x22ad8630 1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bab1f30 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x247e5bc0 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x18883650 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x14476c10 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11a27850 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12008230 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bb0a570 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c94790 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1449e050 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1f00c150 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x15f40550 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x124b6340 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11fe4500 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c70a60 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1427 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12572540 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1428 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1249a420 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0 1430 : st=0x05(R:PrA W:p
Re: leak of handle to /dev/urandom since 1.8?
On Fri, May 03, 2019 at 10:49:54AM +, Robert Allen1 wrote: > For the sake of the list, the patch now looks like: > > +#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) > + if (global.ssl_used_frontend || global.ssl_used_backend) > + /* close random device FDs */ > + RAND_keep_random_devices_open(0); > +#endif > > and requests a backport to 1.8 and 1.9 where we noticed this issue (and > which > include the re-exec for reload code, if I followed its history > thoroughly). > > Rob > I pushed the patch in master, thanks. -- William Lallemand
Re: [PATCH v2 1/2] MINOR: systemd: Use the variables from /etc/default/haproxy
On Mon, May 06, 2019 at 04:07:47PM +0200, William Lallemand wrote: > On Mon, May 06, 2019 at 02:20:32PM +0200, Vincent Bernat wrote: > > However, many people prefer /etc/default and /etc/sysconfig to systemd > > overrides. And for distribution, it enables a smoother transition. For > > Debian, we would still add the EnvironmentFile directive. You could > > still be compatible with both styles of distribution with: > > > > EnvironmentFile=-/etc/default/haproxy > > EnvironmentFile=-/etc/sysconfig/haproxy > > Oh that's right, I forgot that the - was checking if the file exists, looks > like > a good solution. > Just pushed in master the 2 previous patches + a patch which add /etc/sysconfig/haproxy. Thanks everyone. -- William Lallemand
Re: [External] Re: QAT intermittent healthcheck errors
On 5/7/19 1:24 PM, Marcin Deranek wrote: > Hi Emeric, > > On 5/7/19 11:44 AM, Emeric Brun wrote: >> Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see >> attachment for end result). Unfortunately after applying the patch there is >> no change in behavior: we still leak /dev/usdm_drv descriptors and have >> "stuck" HAProxy instances after reload.. > Regards, >> >> Could you perform a test recompiling the usdm_drv and the engine with this >> patch, it applies on QAT 1.7 but I've no hardware to test this version here. >> >> It should fix the fd leak. > > It did fix fd leak: > > # ls -al /proc/2565/fd|fgrep dev > lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null > lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv > > # systemctl reload haproxy.service > # ls -al /proc/2565/fd|fgrep dev > lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null > lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv > > # systemctl reload haproxy.service > # ls -al /proc/2565/fd|fgrep dev > lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null > lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv > > But there are still stuck processes :-( This is with both patches included: > for QAT and HAProxy. > Regards, > > Marcin Deranek Thank you Marcin! Anyway it's was also a bug. Could you process a 'show fds' command on a stucked process adding the patch in attachement. R, Emeric >From d0e095c2aa54f020de8fc50db867eff1ef73350e Mon Sep 17 00:00:00 2001 From: Emeric Brun Date: Fri, 19 Apr 2019 17:15:28 +0200 Subject: [PATCH] MINOR: ssl/cli: async fd io-handlers printable on show fd This patch exports the async fd iohandlers and make them printable doing a 'show fd' on cli. --- include/proto/ssl_sock.h | 4 src/cli.c| 9 + src/ssl_sock.c | 4 ++-- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/proto/ssl_sock.h b/include/proto/ssl_sock.h index 62ebcb87..ce52fb74 100644 --- a/include/proto/ssl_sock.h +++ b/include/proto/ssl_sock.h @@ -85,6 +85,10 @@ SSL_CTX *ssl_sock_get_generated_cert(unsigned int key, struct bind_conf *bind_co int ssl_sock_set_generated_cert(SSL_CTX *ctx, unsigned int key, struct bind_conf *bind_conf); unsigned int ssl_sock_generated_cert_key(const void *data, size_t len); +#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC) +void ssl_async_fd_handler(int fd); +void ssl_async_fd_free(int fd); +#endif /* ssl shctx macro */ diff --git a/src/cli.c b/src/cli.c index 568ceba2..843c3d04 100644 --- a/src/cli.c +++ b/src/cli.c @@ -69,6 +69,9 @@ #include #include #include +#ifdef USE_OPENSSL +#include +#endif #define PAYLOAD_PATTERN "<<" @@ -998,6 +1001,12 @@ static int cli_io_handler_show_fd(struct appctx *appctx) (fdt.iocb == listener_accept) ? "listener_accept" : (fdt.iocb == poller_pipe_io_handler) ? "poller_pipe_io_handler" : (fdt.iocb == mworker_accept_wrapper) ? "mworker_accept_wrapper" : +#ifdef USE_OPENSSL +#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC) + (fdt.iocb == ssl_async_fd_free) ? "ssl_async_fd_free" : + (fdt.iocb == ssl_async_fd_handler) ? "ssl_async_fd_handler" : +#endif +#endif "unknown"); if (fdt.iocb == conn_fd_handler) { diff --git a/src/ssl_sock.c b/src/ssl_sock.c index 112520c8..58ae8a26 100644 --- a/src/ssl_sock.c +++ b/src/ssl_sock.c @@ -573,7 +573,7 @@ fail_get: /* * openssl async fd handler */ -static void ssl_async_fd_handler(int fd) +void ssl_async_fd_handler(int fd) { struct connection *conn = fdtab[fd].owner; @@ -594,7 +594,7 @@ static void ssl_async_fd_handler(int fd) /* * openssl async delayed SSL_free handler */ -static void ssl_async_fd_free(int fd) +void ssl_async_fd_free(int fd) { SSL *ssl = fdtab[fd].owner; OSSL_ASYNC_FD all_fd[32]; -- 2.17.1
Re: [External] Re: QAT intermittent healthcheck errors
Hi Emeric, On 5/7/19 11:44 AM, Emeric Brun wrote: Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. Regards, Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. It did fix fd leak: # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 7 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 8 -> /dev/usdm_drv # systemctl reload haproxy.service # ls -al /proc/2565/fd|fgrep dev lr-x-- 1 root root 64 May 7 13:15 0 -> /dev/null lrwx-- 1 root root 64 May 7 13:15 9 -> /dev/usdm_drv But there are still stuck processes :-( This is with both patches included: for QAT and HAProxy. Regards, Marcin Deranek
Re: QAT intermittent healthcheck errors
On 5/7/19 11:44 AM, Emeric Brun wrote: Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. Will do and report back. Marcin Deranek
Re: QAT intermittent healthcheck errors
Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy instances after reload.. >>> Regards, >> >> Could you perform a test recompiling the usdm_drv and the engine with this patch, it applies on QAT 1.7 but I've no hardware to test this version here. It should fix the fd leak. R, Emeric diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c --- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c 2019-05-07 11:35:15.654202291 +0200 +++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c 2019-05-07 11:35:44.302292417 +0200 @@ -104,7 +104,7 @@ /* standard page size */ page_size = getpagesize(); -fd = qae_open("/proc/self/pagemap", O_RDONLY); +fd = qae_open("/proc/self/pagemap", O_RDONLY|O_CLOEXEC); if (fd < 0) { return 0; diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c --- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c 2019-03-15 15:23:43.0 +0100 +++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c 2019-05-07 11:24:08.755921241 +0200 @@ -745,7 +745,7 @@ if (fd > 0) close(fd); -fd = qae_open(QAE_MEM, O_RDWR); +fd = qae_open(QAE_MEM, O_RDWR|O_CLOEXEC); if (fd < 0) { CMD_ERROR("%s:%d Unable to initialize memory file handle %s \n",