Re: Clients occasionally see truncated responses
On Wed, Mar 31, 2021 at 09:55:15AM -0700, Nathan Konopinski wrote: > Thanks Willy, that is what I'm seeing, capture attached. Clients only send > GETs, no POSTs. What are possible workarounds? Is there a way to ignore the > client close and keep the connection open longer? It's not what happens here, I'm not seeing any request. There's a TLS exchange and immediately the client closes after the handshake without sending a request. There might be something the client doesn't like, such as a cipher or something like this. What's surprising is that the client doesn't close after a response but after sending its final handshake. Out of curiosity, are you sure this is a valid client that produced this trace ? Maybe it's just a random scanner that sent a request to your site? > I'm wondering if nginx is > doing something like that since we don't see issues with it. It's difficult to say for now, especially since this trace doesn't show a request but a spontaneous close. Have you tried to temporarily disable your ssl-default-bind-options directive ? Maybe the client doesn't like the no-tls-tickets for example ? Willy
Re: Clients occasionally see truncated responses
Hi Nathan, On Tue, Mar 30, 2021 at 09:21:30AM -0700, Nathan Konopinski wrote: > Sometimes clients (clients are only http 1.1 and use connection: close) are > reporting a body length of ~4000 is less than the content length of ~14000. > The issue does not appear when using nginx as an LB and I've verified > complete responses are being sent from the backends for the requests > clients report errors on. > > It's not clear why a portion of the clients aren't receiving the entire > response. I'm unable to replicate the issue with curl. I have a vanilla > config using https, prometheus metrics, and a h1-case-adjust-bogus-client > option to adjust a couple headers. > > Has anyone come across similar issues? I see an option for request > buffering but nothing for response buffering. Are there options I can > adjust that could be related to this type of issue? No it's not expected at all and should really never happen. One option could have caused this to happen, it's "option nolinger" but you don't have it and your config is really clean and straightforward. Could you take a capture of the communications between the clients and haproxy ? The fact that you're using close opens the way for a subtle issue that affects certain old clients with POST requests. Some of them send POST requests with a body, and for now particular reason after half a second to a second, they emit a CRLF that cannot be read as not being part of the current body, and could even happen after the response. If haproxy has already sent the response back (and 14kB perfectly fit in a single buffer so that sounds plausible), closed (since there's the connection: close), and the CRLF from the client arrives *after* the close, then the TCP stack will reset the connection and send a TCP RST back. First this will result in pending data to be dropped. Second, when the client receives it, it can also drop some of its previously received but unread data. You don't necessarily need to decrypt HTTPS to detect this. Simply taking a network capture, looking for RSTs and checking if some non-empty TCP segments flow from the client to haproxy just before the RST would already be an indication. What's nasty if you have to deal with this is that it's totally timing-dependent, and that possible workarounds are just that, workarounds. Regards, Willy
[ANNOUNCE] haproxy-2.2.12
Hi, HAProxy 2.2.12 was released on 2021/03/31. It added 29 new commits after version 2.2.11. This makes 2.2.12 catch up with the fixes that went into 2.3.9: - One issue was a regression of the rate counters causing those spanning over a period (like in stick-tables) to increase forever consecutive to a fix in 2.2.11 to prevent them from being randomly reset every second. - A rare issue causing old processes to abort on reload due to a deadlock between the listeners and the file descriptors was also addressed. This one was unveiled in 2.2.10 and was not visible before due to another bug! - In the unlikely even that the watchdog would trigger within Lua code (most likely caused by threads waiting on the Lua lock), it was sometimes possible to deadlock inside the libc on its own malloc() lock when trying to dump the Lua backtrace. This was addressed by using the home-grown backtrace function instead which doesn't require allocations. - Processes built with DEBUG_UAF could deadlock when doing this under thread isolation. - The fix for too lax hdr_ip() parsing was integrated (it could incorrectly return only the parsable part of an address if the sender would send garbage). - The H1's shutdown code was made idempotent (as it ought to be). Only a single user faces some crashes on this one, it's very strange, it indicates that a number of conditions must be met to trigger it. - The SSL fixes for "add ssl crt-list" making inconsistent use of FS accesses at run time vs boot time were integrated. - down-going-up server state transition on the stats page was mistakenly reported as the same color as up-going-down. - unix-bind-prefix was incorrectly applied to the master socket. And among the recent ones that were merged into 2.3-maint after 2.3.9: - the fix for the silent-drop fallback in IPv6 was merged (the TTL is IPV6_UNICAST_HOPS in this case) - the update on the CLI of the default SSL certificate used not to work correctly as the previous one was not removed, resulting in a random behavior namely on the SNI. This time I hope that all the recent mess experienced since 2.2.10 was properly addressed. Those who faced DNS issues when upgrading from 2.2.9 to 2.2.10 or rate counter issues from 2.2.10 to 2.2.11, and who possibly rolled back to 2.2.9 are strongly encouraged to try again. Please find the usual URLs below : Site index : http://www.haproxy.org/ Discourse: http://discourse.haproxy.org/ Slack channel: https://slack.haproxy.org/ Issue tracker: https://github.com/haproxy/haproxy/issues Wiki : https://github.com/haproxy/wiki/wiki Sources : http://www.haproxy.org/download/2.2/src/ Git repository : http://git.haproxy.org/git/haproxy-2.2.git/ Git Web browsing : http://git.haproxy.org/?p=haproxy-2.2.git Changelog: http://www.haproxy.org/download/2.2/src/CHANGELOG Cyril's HTML doc : http://cbonte.github.io/haproxy-dconv/ Willy --- Complete changelog : Christopher Faulet (7): MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable MINOR: lua: Slightly improve function dumping the lua traceback BUG/MEDIUM: debug/lua: Use internal hlua function to dump the lua traceback BUG/MEDIUM: lua: Always init the lua stack before referencing the context BUG/MEDIUM: thread: Fix a deadlock if an isolated thread is marked as harmless BUG/MINOR: payload: Wait for more data if buffer is empty in payload/payload_lv Eric Salama (1): MINOR/BUG: mworker/cli: do not use the unix_bind prefix for the master CLI socket Florian Apolloner (1): BUG/MINOR: stats: Apply proper styles in HTML status page. Ilya Shipitsin (1): BUILD: ssl: guard ecdh functions with SSL_CTX_set_tmp_ecdh macro Olivier Houchard (1): BUG/MEDIUM: fd: Take the fd_mig_lock when closing if no DWCAS is available. Remi Tricot-Le Breton (4): BUG/MINOR: ssl: Prevent disk access when using "add ssl crt-list" BUG/MINOR: ssl: Fix update of default certificate BUG/MINOR: ssl: Prevent removal of crt-list line if the instance is a default one BUG/MINOR: ssl: Add missing free on SSL_CTX in ckch_inst_free Willy Tarreau (14): MINOR: time: also provide a global, monotonic global_now_ms timer BUG/MEDIUM: freq_ctr/threads: use the global_now_ms variable MINOR: fd: make fd_clr_running() return the remaining running mask MINOR: fd: remove the unneeded running bit from fd_insert() BUG/MEDIUM: fd: do not wait on FD removal in fd_delete() CLEANUP: fd: remove unused fd_set_running_excl() MINOR: tools: make url2ipv4 return the exact number of bytes parsed BUG/MINOR: http_fetch: make hdr_ip() reject trailing characters BUG/MEDIUM: mux-h1: make
Re: [ANNOUNCE] haproxy-2.3.9
On Wed, Mar 31, 2021 at 02:29:40PM +0200, Vincent Bernat wrote: > ? 31 mars 2021 12:46 +02, Willy Tarreau: > > > On the kernel Greg solved all this by issuing all versions very > > frequently: as long as you produce updates faster than users are > > willing to deploy them, they can choose what to do. It just requires > > a bandwidth that we don't have :-/ Some weeks several of us work full > > time on backports and tests! Right now we've reached a point where > > backports can prevent us from working on mainline, and where this lack > > of time increases the risk of regressions, and the regressions require > > more backport time. > > Wouldn't this mean there are too many versions in parallel? It cannot be summed up this easily. Normally, old versions are not released often so they don't cost much. But not releasing them often complicates the backports and their testing so it's still better to try to feed them along with the other ones. However, releasing them in parallel to the other ones makes them more susceptible to get stupid issues like the last build failure with libmusl. But not releasing them wouldn't change much given that build failures in certain environments are only detected once the release sends the signal that it's time to update :-/ With this said, while the adoption of non-LTS versions has added one to two versions to the series, it has significantly reduced the pain of certain backports precisely because it resulted in splitting the population of users. So at the cost of ~1 more version in the pipe, we get more detailed reports from users who are more accustomed to enabling core dumps, firing gdb, applying patches etc, which reduces the time spent on bugs and increases the confidence in fixes that get backported. So I'd say that it remains a very good investment. However I wanted to make sure we shorten the non-LTS versions' life to limit the in-field fragmentation. And this works extremely well (I'm very grateful to our users for this, and I suspect that the status banner in the executable reminding about EOL helps). We probably have not seen any single 2.1 report in the issues over the last 3-4 months. And I expect that 6 months after 2.4 is released, we won't read about 2.3 anymore. Also if you dig into the issue tracker, you'll see a noticeable number of users who accept to run some tests on 2.3 to verify if it fixes an issue they face in 2.2. We're usually not asking for an upgrade, just a test on a very close version. This flexibility is very important as well. So the number of parallel versions is one aspect of the problem but it's also an important part of the solution. I hope we can continue to maintain short lives for non-LTS but at the same time it must remain a win-win: if we get useful reports on one version that are valid for other ones as well, I'm fine with extending it a little bit as we did for 1.9; there's no reason the ones making most efforts are the first ones punished. Overall the real issue remains the number of bugs we introduce in the code and that is unavoidable when working on lower layers where a good test coverage is extremely difficult to achieve. Making smaller and more detailed patches is mandatory. Continuing to add reg-tests definitely helps a lot. We've added more than one reg-test per week since 2.3, that's definitely not bad at all, but this effort must continue! The CI reports few false positives now and the situation has tremendously improved over the last 2 years. So with better code we can hope for less bugs, less fixes, less backports hence less risks of regressions. > > I think that the real problem arrives when a version becomes generally > > available in distros. And distro users are often the ones with the least > > autonomy when it comes to rolling back. When you build from sources, > > you're more at ease. Thus probably that a nice solution would be to > > add an idle period between a stable release and its appearance in > > distros so that it really gets some initial deployment before becoming > > generally available. And I know that some users complain when they do > > not immediately see their binary package, but that's something we can > > easily explain and document. We could even indicate a level of confidence > > in the announce messages. It has the merit of respecting the principle > > of least surprise for everyone in the chain, including those like you > > and me involved in the release cycle and who did not necessarily plan > > to stop all activities to work on yet-another-release because the > > long-awaited fix-of-the-month broke something and its own fix broke > > something else. > > We can do that. In the future, I may even tackle all the problems at > once: providing easy access to old versions and have two versions of > each repository: one with new versions immediately available and one > with a semi-fixed delay. Ah I really like this! Your packages definitely are the most exposed ones so this could very
[ANNOUNCE] haproxy-1.7.14
Hi, HAProxy 1.7.14 was released on 2021/03/31. It added 7 new commits after version 1.7.13, all of which are minor fixes. The main one addresses a build regression when libmusl is used. The other ones are: - fix for the too lax hdr_ip() parsing - for the IPv6 fallback of a failed silent-drop action - a fix for a parsing issue in SPOE that was fixed by accident in 1.8 but which could result in a desynchronized stream on framing error. Unless you think you're affected by any of them there's no need to upgrade if you already deployed 1.7.13 successfully. Please find the usual URLs below : Site index : http://www.haproxy.org/ Discourse: http://discourse.haproxy.org/ Slack channel: https://slack.haproxy.org/ Issue tracker: https://github.com/haproxy/haproxy/issues Wiki : https://github.com/haproxy/wiki/wiki Sources : http://www.haproxy.org/download/1.7/src/ Git repository : http://git.haproxy.org/git/haproxy-1.7.git/ Git Web browsing : http://git.haproxy.org/?p=haproxy-1.7.git Changelog: http://www.haproxy.org/download/1.7/src/CHANGELOG Cyril's HTML doc : http://cbonte.github.io/haproxy-dconv/ Willy --- Complete changelog : Willy Tarreau (7): BUILD: ebtree: fix build on libmusl after recent introduction of eb_memcmp() MINOR: tools: make url2ipv4 return the exact number of bytes parsed BUG/MINOR: http_fetch: make hdr_ip() reject trailing characters BUG/MINOR: http_fetch: make hdr_ip() resistant to empty fields BUG/MINOR: tcp: fix silent-drop workaround for IPv6 BUILD: tcp: use IPPROTO_IPV6 instead of SOL_IPV6 on FreeBSD/MacOS BUG/MINOR: spoe: fix handling of truncated frame ---
Re: [ANNOUNCE] haproxy-2.3.9
❦ 31 mars 2021 12:46 +02, Willy Tarreau: > On the kernel Greg solved all this by issuing all versions very > frequently: as long as you produce updates faster than users are > willing to deploy them, they can choose what to do. It just requires > a bandwidth that we don't have :-/ Some weeks several of us work full > time on backports and tests! Right now we've reached a point where > backports can prevent us from working on mainline, and where this lack > of time increases the risk of regressions, and the regressions require > more backport time. Wouldn't this mean there are too many versions in parallel? > I think that the real problem arrives when a version becomes generally > available in distros. And distro users are often the ones with the least > autonomy when it comes to rolling back. When you build from sources, > you're more at ease. Thus probably that a nice solution would be to > add an idle period between a stable release and its appearance in > distros so that it really gets some initial deployment before becoming > generally available. And I know that some users complain when they do > not immediately see their binary package, but that's something we can > easily explain and document. We could even indicate a level of confidence > in the announce messages. It has the merit of respecting the principle > of least surprise for everyone in the chain, including those like you > and me involved in the release cycle and who did not necessarily plan > to stop all activities to work on yet-another-release because the > long-awaited fix-of-the-month broke something and its own fix broke > something else. We can do that. In the future, I may even tackle all the problems at once: providing easy access to old versions and have two versions of each repository: one with new versions immediately available and one with a semi-fixed delay. -- April 1 This is the day upon which we are reminded of what we are on the other three hundred and sixty-four. -- Mark Twain, "Pudd'nhead Wilson's Calendar"
Re: [ANNOUNCE] haproxy-2.3.9
Hello, Just giving my feedback on part of the story: On 31 Mar 12:46, Willy Tarreau wrote: > On the kernel Greg solved all this by issuing all versions very > frequently: as long as you produce updates faster than users are > willing to deploy them, they can choose what to do. It just requires > a bandwidth that we don't have :-/ Some weeks several of us work full > time on backports and tests! Right now we've reached a point where > backports can prevent us from working on mainline, and where this lack > of time increases the risk of regressions, and the regressions require > more backport time. I just want to say that I greatly appreciate the backport policy of HAProxy. I often see really small bugs or even small improvements being backported, where I personally would have been happy with them just fixed on devel. This is greatly appreciated! -- (o-Julien Pivotto //\Open-Source Consultant V_/_ Inuits - https://www.inuits.eu signature.asc Description: PGP signature
Re: [2.2.9] 100% CPU usage
I've forgot to mention that the backtrace is from 2.2.11 built from http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=601704962bc9d82b3b1cc97d90d2763db0ae4479 śr., 31 mar 2021 o 13:28 Maciej Zdeb napisał(a): > Hi, > > Well it's a bit better situation than earlier because only one thread is > looping forever and the rest is working properly. I've tried to verify > where exactly the thread looped but doing "n" in gdb fixed the problem :( > After quitting gdb session all threads were idle. Before I started gdb it > looped about 3h not serving any traffic, because I've put it into > maintenance as soon as I observed abnormal cpu usage. > > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > 0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920, > maxevents=200, timeout=timeout@entry=39) at > ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. > (gdb) thread 11 > [Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))] > #0 trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=, > where=..., src=, mask=, > level=) at include/haproxy/trace.h:149 > 149 if (unlikely(src->state != TRACE_STATE_STOPPED)) > (gdb) bt > #0 trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=, > where=..., src=, mask=, > level=) at include/haproxy/trace.h:149 > #1 h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740, > head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255 > #2 0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at > src/mux_h2.c:3329 > #3 h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479 > #4 0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at > src/mux_h2.c:3624 > #5 0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740, > status=) at src/mux_h2.c:3583 > #6 0x55d7a4381f62 in run_tasks_from_lists > (budgets=budgets@entry=0x7f2c3c51a35c) > at src/task.c:454 > #7 0x55d7a438282d in process_runnable_tasks () at src/task.c:679 > #8 0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942 > #9 0x55d7a4339819 in run_thread_poll_loop (data=) at > src/haproxy.c:3107 > #10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at > pthread_create.c:463 > #11 0x7f2cf0df671f in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 > (gdb) bt full > #0 trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=, > where=..., src=, mask=, > level=) at include/haproxy/trace.h:149 > No locals. > #1 h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740, > head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255 > h2s = > h2s_back = > __FUNCTION__ = "h2_resume_each_sending_h2s" > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > #2 0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at > src/mux_h2.c:3329 > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > #3 h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479 > flags = > released = > buf = > conn = 0x7f2bf658b8d0 > done = 0 > sent = 0 > __FUNCTION__ = "h2_send" > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > ---Type to continue, or q to quit--- > __l = > __x = > __l = > __x = > __l = > __x = > __l = > #4 0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at > src/mux_h2.c:3624 > conn = 0x7f2bf658b8d0 > __FUNCTION__ = "h2_process" > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > #5 0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740, > status=) at src/mux_h2.c:3583 > conn = 0x7f2bf658b8d0 > tl = > conn_in_list = 0 > h2c = 0x7f2c18dca740 > ret = > __FUNCTION__ = "h2_io_cb" > __x = > __l = > __x = > __l = > __x = > __l = > __x = > __l = > #6 0x55d7a4381f62 in run_tasks_from_lists > (budgets=budgets@entry=0x7f2c3c51a35c) > at src/task.c:454 > process = > tl_queues = > t = 0x7f2c0d3fa1c0 > budget_mask = 7 '\a' > done = > queue = > state = > ---Type to continue, or q to quit--- > ctx = > __ret = > __n = > __p = > #7 0x55d7a438282d in process_runnable_tasks () at
Re: [2.2.9] 100% CPU usage
Hi, Well it's a bit better situation than earlier because only one thread is looping forever and the rest is working properly. I've tried to verify where exactly the thread looped but doing "n" in gdb fixed the problem :( After quitting gdb session all threads were idle. Before I started gdb it looped about 3h not serving any traffic, because I've put it into maintenance as soon as I observed abnormal cpu usage. Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920, maxevents=200, timeout=timeout@entry=39) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. (gdb) thread 11 [Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))] #0 trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=, where=..., src=, mask=, level=) at include/haproxy/trace.h:149 149 if (unlikely(src->state != TRACE_STATE_STOPPED)) (gdb) bt #0 trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=, where=..., src=, mask=, level=) at include/haproxy/trace.h:149 #1 h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740, head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255 #2 0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at src/mux_h2.c:3329 #3 h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479 #4 0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3624 #5 0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740, status=) at src/mux_h2.c:3583 #6 0x55d7a4381f62 in run_tasks_from_lists (budgets=budgets@entry=0x7f2c3c51a35c) at src/task.c:454 #7 0x55d7a438282d in process_runnable_tasks () at src/task.c:679 #8 0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942 #9 0x55d7a4339819 in run_thread_poll_loop (data=) at src/haproxy.c:3107 #10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at pthread_create.c:463 #11 0x7f2cf0df671f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) bt full #0 trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=, where=..., src=, mask=, level=) at include/haproxy/trace.h:149 No locals. #1 h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740, head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255 h2s = h2s_back = __FUNCTION__ = "h2_resume_each_sending_h2s" __x = __l = __x = __l = __x = __l = __x = __l = #2 0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at src/mux_h2.c:3329 __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = #3 h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479 flags = released = buf = conn = 0x7f2bf658b8d0 done = 0 sent = 0 __FUNCTION__ = "h2_send" __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = __x = ---Type to continue, or q to quit--- __l = __x = __l = __x = __l = __x = __l = #4 0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3624 conn = 0x7f2bf658b8d0 __FUNCTION__ = "h2_process" __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = __x = __l = #5 0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740, status=) at src/mux_h2.c:3583 conn = 0x7f2bf658b8d0 tl = conn_in_list = 0 h2c = 0x7f2c18dca740 ret = __FUNCTION__ = "h2_io_cb" __x = __l = __x = __l = __x = __l = __x = __l = #6 0x55d7a4381f62 in run_tasks_from_lists (budgets=budgets@entry=0x7f2c3c51a35c) at src/task.c:454 process = tl_queues = t = 0x7f2c0d3fa1c0 budget_mask = 7 '\a' done = queue = state = ---Type to continue, or q to quit--- ctx = __ret = __n = __p = #7 0x55d7a438282d in process_runnable_tasks () at src/task.c:679 tt = 0x55d7a47a6d00 lrq = grq = t = max = {0, 0, 141} max_total = tmp_list = queue = 3 max_processed = #8 0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942 next = wake = #9 0x55d7a4339819 in run_thread_poll_loop (data=) at src/haproxy.c:3107 ptaf = ptif = ptdf = ptff = init_left = 0 init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust
Re: [ANNOUNCE] haproxy-2.3.9
Hi Vincent! On Wed, Mar 31, 2021 at 12:11:32PM +0200, Vincent Bernat wrote: > It's a bit annoying that fixes reach a LTS version before the non-LTS > one. The upgrade scenario is one annoyance, but if there is a > regression, you also impact far more users. I know, this is also why I'm quite a bit irritated by this. > You could tag releases in > git (with -preX if needed) when preparing the releases and then issue > the release with a few days apart. In practice the tag serves no purpose, but that leads to the same principle as leaving some fixes pending in the -next branch. > Users of older versions will have > less frequent releases in case regressions are spotted, but I think > that's the general expectation: if you are running older releases it's > because you don't have time to upgrade and it's good enough for you. I definitely agree with this and that's also how I'm using LTS versions of various software and why we try to put more care on LTS versions here. > For example: > - 2.3, monthly release or when there is a big regression > - 2.2, 3 days after 2.3 > - 2.0, 3 days after 2.2, skip one out of two releases > - 1.8, 3 days after 2.0, skip one out of four releases > > So, you have a 2.3.9. At the same time, you tag 2.2.12-pre1 (to be > released in 3 working days if everything is fine) and you skip skip 2.0 > and 1.8 this time because they were releases to match 2.3.8. Next time, > you'll have a 2.0.22-pre1 but no 1.8.30-pre1 yet. This will not work. I tried this when I was maintaining kernels, and the reality is that users who stumble on a bug want their fix. And worse, their stability expectations when running on older releases make them even more impatient, because 1) older releases *are* expected to be reliable, 2) they're deployed on sensitive machines, where the business is, and 3) it's expected there are very few pending fixes so for them there's no justification for delaying the fix they're waiting for. > If for some reason, there is an important regression in 2.3.9 you want > to address, you release a 2.3.10 and a 2.2.12-pre2, still no 2.0.22-pre1 > nor 1.8.30-pre1. Hopefully, no more regressions spotted, you tag 2.2.12 > on top of 2.2.12-pre2 and issue a release. The thing is, the -pre releases will just be tags of no use at all. Maintenance branches collect fixes all the time and either you're on a release or you're following -git. And quite frankly, most stable users are on a point release because by definition that's what they need. What I'd like to do is to maintain a small delay between versions, but there is no need to maintain particularly long delays past the next LTS. What needs to be particularly protected are the LTS as a whole. There are more affected users by 2.2 breakage than 2.0 breakage, and the risk is the same for each of them. So instead we should make sure that all versions starting from the first LTS past the latest branch will be slightly delayed. But there's no need to further enforce a delay between them. What this means is that when issuing a 2.3 release, we can wait a bit before issuing the 2.2, and then once 2.2 is emitted, most of the potential damage is already done, so there's no reason for keeping older ones on hold as it can only force their users to live with known bugs. And when the latest branch is an LTS (like in a few months once 2.4 is out), we'd emit 2.4 and 2.3 together, then wait a bit and emit 2.2 and the other ones. This maintains the principle that the LTS before the latest branch should be very stable. With this said, remains the problem of late fixes that I mentioned and that are discovered during this grace period. The tricky ones can wait in the -next branch, but the other ones should be integrated, otherwise the nasty effect is that users think "let's not upgrade to this one but wait for the next one so that I do not have to schedule another update later and that I collect all fixes at once". But if we integrate sensitive fixes in 2.2 that were not yet in a released 2.3, those upgrading will face some breakage. On the kernel Greg solved all this by issuing all versions very frequently: as long as you produce updates faster than users are willing to deploy them, they can choose what to do. It just requires a bandwidth that we don't have :-/ Some weeks several of us work full time on backports and tests! Right now we've reached a point where backports can prevent us from working on mainline, and where this lack of time increases the risk of regressions, and the regressions require more backport time. I think that the real problem arrives when a version becomes generally available in distros. And distro users are often the ones with the least autonomy when it comes to rolling back. When you build from sources, you're more at ease. Thus probably that a nice solution would be to add an idle period between a stable release and its appearance in distros so that it really gets some initial deployment before becoming generally
Re: [ANNOUNCE] haproxy-2.3.9
❦ 31 mars 2021 10:35 +02, Willy Tarreau: >> Thanks Willy for the quick update. That's a good example to avoid >> pushing stable versions at the same time, so we have opportunities to >> find those regressions. > > I know and we're trying to separate them but it considerably increases the > required effort. In addition there is a nasty effect resulting from shifted > releases, which is that it ultimately results in older releases possibly > having more recent fixes than recent ones. And it will happen again with > 2.2.12 which I hope to issue today. It will contain the small fix for the > silent-drop issue (which is already in 2.3 of course) but was merged after > 2.3.9. The reporter of the issue is on 2.2, it would not be fair to him to > release another 2.2 without it (or we'd fall into a bureaucratic process > that doesn't serve users anymore). So 2.2.12 will contain this fix. But > if the person finally decides to upgrade to 2.3.9 a week or two later, she > may face the bug again. It's not a dramatic one so that's acceptable, but > that shows the difficulties of the process. It's a bit annoying that fixes reach a LTS version before the non-LTS one. The upgrade scenario is one annoyance, but if there is a regression, you also impact far more users. You could tag releases in git (with -preX if needed) when preparing the releases and then issue the release with a few days apart. Users of older versions will have less frequent releases in case regressions are spotted, but I think that's the general expectation: if you are running older releases it's because you don't have time to upgrade and it's good enough for you. For example: - 2.3, monthly release or when there is a big regression - 2.2, 3 days after 2.3 - 2.0, 3 days after 2.2, skip one out of two releases - 1.8, 3 days after 2.0, skip one out of four releases So, you have a 2.3.9. At the same time, you tag 2.2.12-pre1 (to be released in 3 working days if everything is fine) and you skip skip 2.0 and 1.8 this time because they were releases to match 2.3.8. Next time, you'll have a 2.0.22-pre1 but no 1.8.30-pre1 yet. If for some reason, there is an important regression in 2.3.9 you want to address, you release a 2.3.10 and a 2.2.12-pre2, still no 2.0.22-pre1 nor 1.8.30-pre1. Hopefully, no more regressions spotted, you tag 2.2.12 on top of 2.2.12-pre2 and issue a release. -- He hath eaten me out of house and home. -- William Shakespeare, "Henry IV"
Re: [ANNOUNCE] haproxy-2.3.9
On Wed, Mar 31, 2021 at 10:17:35AM +0200, William Dauchy wrote: > On Tue, Mar 30, 2021 at 6:59 PM Willy Tarreau wrote: > > HAProxy 2.3.9 was released on 2021/03/30. It added 5 new commits > > after version 2.3.8. > > > > This essentially fixes the rate counters issue that popped up in 2.3.8 > > after the previous fix for the rate counters already. > > > > What happened is that the internal time in millisecond wraps every 49.7 > > days and that the new global counter used to make sure rate counters are > > now stable across threads starts at zero and is initialized when older > > than the current thread's current date. It just happens that the wrapping > > happened a few hours ago at "Mon Mar 29 23:59:46 CEST 2021" exactly and > > that any process started since this date and for the next 24 days doesn't > > validate this condition anymore, hence doesn't rotate its rate counters > > anymore. > > Thanks Willy for the quick update. That's a good example to avoid > pushing stable versions at the same time, so we have opportunities to > find those regressions. I know and we're trying to separate them but it considerably increases the required effort. In addition there is a nasty effect resulting from shifted releases, which is that it ultimately results in older releases possibly having more recent fixes than recent ones. And it will happen again with 2.2.12 which I hope to issue today. It will contain the small fix for the silent-drop issue (which is already in 2.3 of course) but was merged after 2.3.9. The reporter of the issue is on 2.2, it would not be fair to him to release another 2.2 without it (or we'd fall into a bureaucratic process that doesn't serve users anymore). So 2.2.12 will contain this fix. But if the person finally decides to upgrade to 2.3.9 a week or two later, she may face the bug again. It's not a dramatic one so that's acceptable, but that shows the difficulties of the process. In an ideal world, there would be lots of tests in production on stable versions. The reality is that nobody (me included) is interested in upgrading prod servers running flawlessly to just confirm there's no nasty surprise with the forthcoming release, because either there's a bug and you prefer someone else to spot it first, or there's no problem and you'll upgrade once the final version is ready. With this option left off the table, it's clear that the only option that remains is the shifted versions. But here it would not even have provided anything because the code worked on monday and broke on tuesday! What I think we can try to do (and we discussed about this with the other co-maintainers) is to push the patches but not immediately emit the releases (so that the backport work is still factored), and to keep the tricky patches in the -next branch to prevent them from being backported too far too fast (it will save us from the risk of missing them if not merged). Overall the most important solution is that we release often enough so that in case of a regression that affects some users, they can stay on the previous version a little bit more without having to endure too many bugs. And if we don't have too many fixes per release, it's easy to emit yet another small one immediately after to fix a single regression. But over the last week we've been flooded on multiple channels by many reports and then it becomes really hard to focus on a single issue at once for a release :-/ Cheers, Willy
Re: [ANNOUNCE] haproxy-2.3.9
On Tue, Mar 30, 2021 at 6:59 PM Willy Tarreau wrote: > HAProxy 2.3.9 was released on 2021/03/30. It added 5 new commits > after version 2.3.8. > > This essentially fixes the rate counters issue that popped up in 2.3.8 > after the previous fix for the rate counters already. > > What happened is that the internal time in millisecond wraps every 49.7 > days and that the new global counter used to make sure rate counters are > now stable across threads starts at zero and is initialized when older > than the current thread's current date. It just happens that the wrapping > happened a few hours ago at "Mon Mar 29 23:59:46 CEST 2021" exactly and > that any process started since this date and for the next 24 days doesn't > validate this condition anymore, hence doesn't rotate its rate counters > anymore. Thanks Willy for the quick update. That's a good example to avoid pushing stable versions at the same time, so we have opportunities to find those regressions. -- William