Re: Clients occasionally see truncated responses

2021-03-31 Thread Willy Tarreau
On Wed, Mar 31, 2021 at 09:55:15AM -0700, Nathan Konopinski wrote:
> Thanks Willy, that is what I'm seeing, capture attached. Clients only send
> GETs, no POSTs. What are possible workarounds? Is there a way to ignore the
> client close and keep the connection open longer?

It's not what happens here, I'm not seeing any request. There's a TLS
exchange and immediately the client closes after the handshake without
sending a request. There might be something the client doesn't like,
such as a cipher or something like this. What's surprising is that the
client doesn't close after a response but after sending its final
handshake. Out of curiosity, are you sure this is a valid client that
produced this trace ? Maybe it's just a random scanner that sent a
request to your site?

> I'm wondering if nginx is
> doing something like that since we don't see issues with it.

It's difficult to say for now, especially since this trace doesn't show
a request but a spontaneous close.

Have you tried to temporarily disable your ssl-default-bind-options
directive ? Maybe the client doesn't like the no-tls-tickets for
example ?

Willy



Re: Clients occasionally see truncated responses

2021-03-31 Thread Willy Tarreau
Hi Nathan,

On Tue, Mar 30, 2021 at 09:21:30AM -0700, Nathan Konopinski wrote:
> Sometimes clients (clients are only http 1.1 and use connection: close) are
> reporting a body length of ~4000 is less than the content length of ~14000.
> The issue does not appear when using nginx as an LB and I've verified
> complete responses are being sent from the backends for the requests
> clients report errors on.
> 
> It's not clear why a portion of the clients aren't receiving the entire
> response. I'm unable to replicate the issue with curl. I have a vanilla
> config using https, prometheus metrics, and a h1-case-adjust-bogus-client
> option to adjust a couple headers.
> 
> Has anyone come across similar issues? I see an option for request
> buffering but nothing for response buffering. Are there options I can
> adjust that could be related to this type of issue?

No it's not expected at all and should really never happen. One option
could have caused this to happen, it's "option nolinger" but you don't
have it and your config is really clean and straightforward.

Could you take a capture of the communications between the clients and
haproxy ? The fact that you're using close opens the way for a subtle
issue that affects certain old clients with POST requests. Some of them
send POST requests with a body, and for now particular reason after
half a second to a second, they emit a CRLF that cannot be read as not
being part of the current body, and could even happen after the response.

If haproxy has already sent the response back (and 14kB perfectly fit
in a single buffer so that sounds plausible), closed (since there's the
connection: close), and the CRLF from the client arrives *after* the
close, then the TCP stack will reset the connection and send a TCP RST
back. First this will result in pending data to be dropped. Second,
when the client receives it, it can also drop some of its previously
received but unread data.

You don't necessarily need to decrypt HTTPS to detect this. Simply taking
a network capture, looking for RSTs and checking if some non-empty TCP
segments flow from the client to haproxy just before the RST would
already be an indication. What's nasty if you have to deal with this
is that it's totally timing-dependent, and that possible workarounds
are just that, workarounds.

Regards,
Willy



[ANNOUNCE] haproxy-2.2.12

2021-03-31 Thread Willy Tarreau
Hi,

HAProxy 2.2.12 was released on 2021/03/31. It added 29 new commits
after version 2.2.11. This makes 2.2.12 catch up with the fixes that went
into 2.3.9:

  - One issue was a regression of the rate counters causing those
spanning over a period (like in stick-tables) to increase forever
consecutive to a fix in 2.2.11 to prevent them from being randomly
reset every second.

  - A rare issue causing old processes to abort on reload due to a
deadlock between the listeners and the file descriptors was also
addressed. This one was unveiled in 2.2.10 and was not visible
before due to another bug!

  - In the unlikely even that the watchdog would trigger within Lua
code (most likely caused by threads waiting on the Lua lock), it
was sometimes possible to deadlock inside the libc on its own
malloc() lock when trying to dump the Lua backtrace. This was
addressed by using the home-grown backtrace function instead which
doesn't require allocations.

  - Processes built with DEBUG_UAF could deadlock when doing this
under thread isolation.

  - The fix for too lax hdr_ip() parsing was integrated (it could
incorrectly return only the parsable part of an address if the
sender would send garbage).

  - The H1's shutdown code was made idempotent (as it ought to
be). Only a single user faces some crashes on this one, it's very
strange, it indicates that a number of conditions must be met to
trigger it.

  - The SSL fixes for "add ssl crt-list" making inconsistent use of FS
accesses at run time vs boot time were integrated.

  - down-going-up server state transition on the stats page was
mistakenly reported as the same color as up-going-down.

  - unix-bind-prefix was incorrectly applied to the master socket.

And among the recent ones that were merged into 2.3-maint after 2.3.9:
  - the fix for the silent-drop fallback in IPv6 was merged (the TTL is
IPV6_UNICAST_HOPS in this case)

  - the update on the CLI of the default SSL certificate used not to work
correctly as the previous one was not removed, resulting in a random
behavior namely on the SNI.

This time I hope that all the recent mess experienced since 2.2.10 was
properly addressed. Those who faced DNS issues when upgrading from 2.2.9
to 2.2.10 or rate counter issues from 2.2.10 to 2.2.11, and who possibly
rolled back to 2.2.9 are strongly encouraged to try again.

Please find the usual URLs below :
   Site index   : http://www.haproxy.org/
   Discourse: http://discourse.haproxy.org/
   Slack channel: https://slack.haproxy.org/
   Issue tracker: https://github.com/haproxy/haproxy/issues
   Wiki : https://github.com/haproxy/wiki/wiki
   Sources  : http://www.haproxy.org/download/2.2/src/
   Git repository   : http://git.haproxy.org/git/haproxy-2.2.git/
   Git Web browsing : http://git.haproxy.org/?p=haproxy-2.2.git
   Changelog: http://www.haproxy.org/download/2.2/src/CHANGELOG
   Cyril's HTML doc : http://cbonte.github.io/haproxy-dconv/

Willy
---
Complete changelog :
Christopher Faulet (7):
  MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts 
of lua
  BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable
  MINOR: lua: Slightly improve function dumping the lua traceback
  BUG/MEDIUM: debug/lua: Use internal hlua function to dump the lua 
traceback
  BUG/MEDIUM: lua: Always init the lua stack before referencing the context
  BUG/MEDIUM: thread: Fix a deadlock if an isolated thread is marked as 
harmless
  BUG/MINOR: payload: Wait for more data if buffer is empty in 
payload/payload_lv

Eric Salama (1):
  MINOR/BUG: mworker/cli: do not use the unix_bind prefix for the master 
CLI socket

Florian Apolloner (1):
  BUG/MINOR: stats: Apply proper styles in HTML status page.

Ilya Shipitsin (1):
  BUILD: ssl: guard ecdh functions with SSL_CTX_set_tmp_ecdh macro

Olivier Houchard (1):
  BUG/MEDIUM: fd: Take the fd_mig_lock when closing if no DWCAS is 
available.

Remi Tricot-Le Breton (4):
  BUG/MINOR: ssl: Prevent disk access when using "add ssl crt-list"
  BUG/MINOR: ssl: Fix update of default certificate
  BUG/MINOR: ssl: Prevent removal of crt-list line if the instance is a 
default one
  BUG/MINOR: ssl: Add missing free on SSL_CTX in ckch_inst_free

Willy Tarreau (14):
  MINOR: time: also provide a global, monotonic global_now_ms timer
  BUG/MEDIUM: freq_ctr/threads: use the global_now_ms variable
  MINOR: fd: make fd_clr_running() return the remaining running mask
  MINOR: fd: remove the unneeded running bit from fd_insert()
  BUG/MEDIUM: fd: do not wait on FD removal in fd_delete()
  CLEANUP: fd: remove unused fd_set_running_excl()
  MINOR: tools: make url2ipv4 return the exact number of bytes parsed
  BUG/MINOR: http_fetch: make hdr_ip() reject trailing characters
  BUG/MEDIUM: mux-h1: make 

Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread Willy Tarreau
On Wed, Mar 31, 2021 at 02:29:40PM +0200, Vincent Bernat wrote:
>  ? 31 mars 2021 12:46 +02, Willy Tarreau:
> 
> > On the kernel Greg solved all this by issuing all versions very
> > frequently: as long as you produce updates faster than users are
> > willing to deploy them, they can choose what to do. It just requires
> > a bandwidth that we don't have :-/ Some weeks several of us work full
> > time on backports and tests! Right now we've reached a point where
> > backports can prevent us from working on mainline, and where this lack
> > of time increases the risk of regressions, and the regressions require
> > more backport time.
> 
> Wouldn't this mean there are too many versions in parallel?

It cannot be summed up this easily. Normally, old versions are not
released often so they don't cost much. But not releasing them often
complicates the backports and their testing so it's still better to
try to feed them along with the other ones. However, releasing them
in parallel to the other ones makes them more susceptible to get stupid
issues like the last build failure with libmusl. But not releasing them
wouldn't change much given that build failures in certain environments
are only detected once the release sends the signal that it's time to
update :-/

With this said, while the adoption of non-LTS versions has added one
to two versions to the series, it has significantly reduced the pain
of certain backports precisely because it resulted in splitting the
population of users. So at the cost of ~1 more version in the pipe,
we get more detailed reports from users who are more accustomed to
enabling core dumps, firing gdb, applying patches etc, which reduces
the time spent on bugs and increases the confidence in fixes that get
backported. So I'd say that it remains a very good investment. However
I wanted to make sure we shorten the non-LTS versions' life to limit
the in-field fragmentation. And this works extremely well (I'm very
grateful to our users for this, and I suspect that the status banner
in the executable reminding about EOL helps). We probably have not
seen any single 2.1 report in the issues over the last 3-4 months.
And I expect that 6 months after 2.4 is released, we won't read about
2.3 anymore.

Also if you dig into the issue tracker, you'll see a noticeable number
of users who accept to run some tests on 2.3 to verify if it fixes an
issue they face in 2.2. We're usually not asking for an upgrade, just
a test on a very close version. This flexibility is very important as
well.

So the number of parallel versions is one aspect of the problem but
it's also an important part of the solution. I hope we can continue to
maintain short lives for non-LTS but at the same time it must remain a
win-win: if we get useful reports on one version that are valid for
other ones as well, I'm fine with extending it a little bit as we did
for 1.9; there's no reason the ones making most efforts are the first
ones punished.

Overall the real issue remains the number of bugs we introduce in the
code and that is unavoidable when working on lower layers where a good
test coverage is extremely difficult to achieve. Making smaller and more
detailed patches is mandatory. Continuing to add reg-tests definitely
helps a lot. We've added more than one reg-test per week since 2.3,
that's definitely not bad at all, but this effort must continue! The
CI reports few false positives now and the situation has tremendously
improved over the last 2 years. So with better code we can hope for
less bugs, less fixes, less backports hence less risks of regressions.

> > I think that the real problem arrives when a version becomes generally
> > available in distros. And distro users are often the ones with the least
> > autonomy when it comes to rolling back. When you build from sources,
> > you're more at ease. Thus probably that a nice solution would be to
> > add an idle period between a stable release and its appearance in
> > distros so that it really gets some initial deployment before becoming
> > generally available. And I know that some users complain when they do
> > not immediately see their binary package, but that's something we can
> > easily explain and document. We could even indicate a level of confidence
> > in the announce messages. It has the merit of respecting the principle
> > of least surprise for everyone in the chain, including those like you
> > and me involved in the release cycle and who did not necessarily plan
> > to stop all activities to work on yet-another-release because the
> > long-awaited fix-of-the-month broke something and its own fix broke
> > something else.
> 
> We can do that. In the future, I may even tackle all the problems at
> once: providing easy access to old versions and have two versions of
> each repository: one with new versions immediately available and one
> with a semi-fixed delay.

Ah I really like this! Your packages definitely are the most exposed
ones so this could very 

[ANNOUNCE] haproxy-1.7.14

2021-03-31 Thread Willy Tarreau
Hi,

HAProxy 1.7.14 was released on 2021/03/31. It added 7 new commits after
version 1.7.13, all of which are minor fixes. The main one addresses a
build regression when libmusl is used. The other ones are:
  - fix for the too lax hdr_ip() parsing
  - for the IPv6 fallback of a failed silent-drop action
  - a fix for a parsing issue in SPOE that was fixed by accident in 1.8
but which could result in a desynchronized stream on framing error.

Unless you think you're affected by any of them there's no need to upgrade
if you already deployed 1.7.13 successfully.

Please find the usual URLs below :
   Site index   : http://www.haproxy.org/
   Discourse: http://discourse.haproxy.org/
   Slack channel: https://slack.haproxy.org/
   Issue tracker: https://github.com/haproxy/haproxy/issues
   Wiki : https://github.com/haproxy/wiki/wiki
   Sources  : http://www.haproxy.org/download/1.7/src/
   Git repository   : http://git.haproxy.org/git/haproxy-1.7.git/
   Git Web browsing : http://git.haproxy.org/?p=haproxy-1.7.git
   Changelog: http://www.haproxy.org/download/1.7/src/CHANGELOG
   Cyril's HTML doc : http://cbonte.github.io/haproxy-dconv/

Willy
---
Complete changelog :
Willy Tarreau (7):
  BUILD: ebtree: fix build on libmusl after recent introduction of 
eb_memcmp()
  MINOR: tools: make url2ipv4 return the exact number of bytes parsed
  BUG/MINOR: http_fetch: make hdr_ip() reject trailing characters
  BUG/MINOR: http_fetch: make hdr_ip() resistant to empty fields
  BUG/MINOR: tcp: fix silent-drop workaround for IPv6
  BUILD: tcp: use IPPROTO_IPV6 instead of SOL_IPV6 on FreeBSD/MacOS
  BUG/MINOR: spoe: fix handling of truncated frame

---



Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread Vincent Bernat
 ❦ 31 mars 2021 12:46 +02, Willy Tarreau:

> On the kernel Greg solved all this by issuing all versions very
> frequently: as long as you produce updates faster than users are
> willing to deploy them, they can choose what to do. It just requires
> a bandwidth that we don't have :-/ Some weeks several of us work full
> time on backports and tests! Right now we've reached a point where
> backports can prevent us from working on mainline, and where this lack
> of time increases the risk of regressions, and the regressions require
> more backport time.

Wouldn't this mean there are too many versions in parallel?

> I think that the real problem arrives when a version becomes generally
> available in distros. And distro users are often the ones with the least
> autonomy when it comes to rolling back. When you build from sources,
> you're more at ease. Thus probably that a nice solution would be to
> add an idle period between a stable release and its appearance in
> distros so that it really gets some initial deployment before becoming
> generally available. And I know that some users complain when they do
> not immediately see their binary package, but that's something we can
> easily explain and document. We could even indicate a level of confidence
> in the announce messages. It has the merit of respecting the principle
> of least surprise for everyone in the chain, including those like you
> and me involved in the release cycle and who did not necessarily plan
> to stop all activities to work on yet-another-release because the
> long-awaited fix-of-the-month broke something and its own fix broke
> something else.

We can do that. In the future, I may even tackle all the problems at
once: providing easy access to old versions and have two versions of
each repository: one with new versions immediately available and one
with a semi-fixed delay.
-- 
April 1

This is the day upon which we are reminded of what we are on the other three
hundred and sixty-four.
-- Mark Twain, "Pudd'nhead Wilson's Calendar"



Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread Julien Pivotto
Hello,

Just giving my feedback on part of the story:

On 31 Mar 12:46, Willy Tarreau wrote:
> On the kernel Greg solved all this by issuing all versions very
> frequently: as long as you produce updates faster than users are
> willing to deploy them, they can choose what to do. It just requires
> a bandwidth that we don't have :-/ Some weeks several of us work full
> time on backports and tests! Right now we've reached a point where
> backports can prevent us from working on mainline, and where this lack
> of time increases the risk of regressions, and the regressions require
> more backport time.

I just want to say that I greatly appreciate the backport policy of
HAProxy. I often see really small bugs or even small improvements being
backported, where I personally would have been happy with them just
fixed on devel. This is greatly appreciated!

-- 
 (o-Julien Pivotto
 //\Open-Source Consultant
 V_/_   Inuits - https://www.inuits.eu


signature.asc
Description: PGP signature


Re: [2.2.9] 100% CPU usage

2021-03-31 Thread Maciej Zdeb
I've forgot to mention that the backtrace is from 2.2.11 built from
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=601704962bc9d82b3b1cc97d90d2763db0ae4479

śr., 31 mar 2021 o 13:28 Maciej Zdeb  napisał(a):

> Hi,
>
> Well it's a bit better situation than earlier because only one thread is
> looping forever and the rest is working properly. I've tried to verify
> where exactly the thread looped but doing "n" in gdb fixed the problem :(
> After quitting gdb session all threads were idle. Before I started gdb it
> looped about 3h not serving any traffic, because I've put it into
> maintenance as soon as I observed abnormal cpu usage.
>
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> 0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920,
> maxevents=200, timeout=timeout@entry=39) at
> ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
> (gdb) thread 11
> [Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))]
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> 149 if (unlikely(src->state != TRACE_STATE_STOPPED))
> (gdb) bt
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> #1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
> head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
> #2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
> src/mux_h2.c:3329
> #3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
> #4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
> src/mux_h2.c:3624
> #5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
> status=) at src/mux_h2.c:3583
> #6  0x55d7a4381f62 in run_tasks_from_lists 
> (budgets=budgets@entry=0x7f2c3c51a35c)
> at src/task.c:454
> #7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
> #8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
> #9  0x55d7a4339819 in run_thread_poll_loop (data=) at
> src/haproxy.c:3107
> #10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at
> pthread_create.c:463
> #11 0x7f2cf0df671f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) bt full
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> No locals.
> #1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
> head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
> h2s = 
> h2s_back = 
> __FUNCTION__ = "h2_resume_each_sending_h2s"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
> src/mux_h2.c:3329
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
> flags = 
> released = 
> buf = 
> conn = 0x7f2bf658b8d0
> done = 0
> sent = 0
> __FUNCTION__ = "h2_send"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> ---Type  to continue, or q  to quit---
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
> src/mux_h2.c:3624
> conn = 0x7f2bf658b8d0
> __FUNCTION__ = "h2_process"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
> status=) at src/mux_h2.c:3583
> conn = 0x7f2bf658b8d0
> tl = 
> conn_in_list = 0
> h2c = 0x7f2c18dca740
> ret = 
> __FUNCTION__ = "h2_io_cb"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #6  0x55d7a4381f62 in run_tasks_from_lists 
> (budgets=budgets@entry=0x7f2c3c51a35c)
> at src/task.c:454
> process = 
> tl_queues = 
> t = 0x7f2c0d3fa1c0
> budget_mask = 7 '\a'
> done = 
> queue = 
> state = 
> ---Type  to continue, or q  to quit---
> ctx = 
> __ret = 
> __n = 
> __p = 
> #7  0x55d7a438282d in process_runnable_tasks () at 

Re: [2.2.9] 100% CPU usage

2021-03-31 Thread Maciej Zdeb
Hi,

Well it's a bit better situation than earlier because only one thread is
looping forever and the rest is working properly. I've tried to verify
where exactly the thread looped but doing "n" in gdb fixed the problem :(
After quitting gdb session all threads were idle. Before I started gdb it
looped about 3h not serving any traffic, because I've put it into
maintenance as soon as I observed abnormal cpu usage.

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920,
maxevents=200, timeout=timeout@entry=39) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) thread 11
[Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))]
#0  trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=,
where=..., src=, mask=,
level=) at include/haproxy/trace.h:149
149 if (unlikely(src->state != TRACE_STATE_STOPPED))
(gdb) bt
#0  trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=,
where=..., src=, mask=,
level=) at include/haproxy/trace.h:149
#1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
#2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
src/mux_h2.c:3329
#3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
#4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
src/mux_h2.c:3624
#5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
status=) at src/mux_h2.c:3583
#6  0x55d7a4381f62 in run_tasks_from_lists
(budgets=budgets@entry=0x7f2c3c51a35c)
at src/task.c:454
#7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
#8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
#9  0x55d7a4339819 in run_thread_poll_loop (data=) at
src/haproxy.c:3107
#10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at
pthread_create.c:463
#11 0x7f2cf0df671f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) bt full
#0  trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=,
where=..., src=, mask=,
level=) at include/haproxy/trace.h:149
No locals.
#1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
h2s = 
h2s_back = 
__FUNCTION__ = "h2_resume_each_sending_h2s"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
src/mux_h2.c:3329
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
flags = 
released = 
buf = 
conn = 0x7f2bf658b8d0
done = 0
sent = 0
__FUNCTION__ = "h2_send"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
---Type  to continue, or q  to quit---
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
src/mux_h2.c:3624
conn = 0x7f2bf658b8d0
__FUNCTION__ = "h2_process"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
status=) at src/mux_h2.c:3583
conn = 0x7f2bf658b8d0
tl = 
conn_in_list = 0
h2c = 0x7f2c18dca740
ret = 
__FUNCTION__ = "h2_io_cb"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#6  0x55d7a4381f62 in run_tasks_from_lists
(budgets=budgets@entry=0x7f2c3c51a35c)
at src/task.c:454
process = 
tl_queues = 
t = 0x7f2c0d3fa1c0
budget_mask = 7 '\a'
done = 
queue = 
state = 
---Type  to continue, or q  to quit---
ctx = 
__ret = 
__n = 
__p = 
#7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
tt = 0x55d7a47a6d00 
lrq = 
grq = 
t = 
max = {0, 0, 141}
max_total = 
tmp_list = 
queue = 3
max_processed = 
#8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
next = 
wake = 
#9  0x55d7a4339819 in run_thread_poll_loop (data=) at
src/haproxy.c:3107
ptaf = 
ptif = 
ptdf = 
ptff = 
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not
acquired, Robust 

Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread Willy Tarreau
Hi Vincent!

On Wed, Mar 31, 2021 at 12:11:32PM +0200, Vincent Bernat wrote:
> It's a bit annoying that fixes reach a LTS version before the non-LTS
> one. The upgrade scenario is one annoyance, but if there is a
> regression, you also impact far more users.

I know, this is also why I'm quite a bit irritated by this.

> You could tag releases in
> git (with -preX if needed) when preparing the releases and then issue
> the release with a few days apart.

In practice the tag serves no purpose, but that leads to the same
principle as leaving some fixes pending in the -next branch.

> Users of older versions will have
> less frequent releases in case regressions are spotted, but I think
> that's the general expectation: if you are running older releases it's
> because you don't have time to upgrade and it's good enough for you.

I definitely agree with this and that's also how I'm using LTS versions
of various software and why we try to put more care on LTS versions here.

> For example:
>  - 2.3, monthly release or when there is a big regression
>  - 2.2, 3 days after 2.3
>  - 2.0, 3 days after 2.2, skip one out of two releases
>  - 1.8, 3 days after 2.0, skip one out of four releases
> 
> So, you have a 2.3.9. At the same time, you tag 2.2.12-pre1 (to be
> released in 3 working days if everything is fine) and you skip skip 2.0
> and 1.8 this time because they were releases to match 2.3.8. Next time,
> you'll have a 2.0.22-pre1 but no 1.8.30-pre1 yet.

This will not work. I tried this when I was maintaining kernels, and the
reality is that users who stumble on a bug want their fix. And worse,
their stability expectations when running on older releases make them
even more impatient, because 1) older releases *are* expected to be
reliable, 2) they're deployed on sensitive machines, where the business
is, and 3) it's expected there are very few pending fixes so for them
there's no justification for delaying the fix they're waiting for.

> If for some reason, there is an important regression in 2.3.9 you want
> to address, you release a 2.3.10 and a 2.2.12-pre2, still no 2.0.22-pre1
> nor 1.8.30-pre1. Hopefully, no more regressions spotted, you tag 2.2.12
> on top of 2.2.12-pre2 and issue a release.

The thing is, the -pre releases will just be tags of no use at all.
Maintenance branches collect fixes all the time and either you're on a
release or you're following -git. And quite frankly, most stable users
are on a point release because by definition that's what they need. What
I'd like to do is to maintain a small delay between versions, but there
is no need to maintain particularly long delays past the next LTS.

What needs to be particularly protected are the LTS as a whole. There
are more affected users by 2.2 breakage than 2.0 breakage, and the risk
is the same for each of them. So instead we should make sure that all
versions starting from the first LTS past the latest branch will be
slightly delayed. But there's no need to further enforce a delay between
them.

What this means is that when issuing a 2.3 release, we can wait a bit
before issuing the 2.2, and then once 2.2 is emitted, most of the
potential damage is already done, so there's no reason for keeping older
ones on hold as it can only force their users to live with known bugs.

And when the latest branch is an LTS (like in a few months once 2.4 is
out), we'd emit 2.4 and 2.3 together, then wait a bit and emit 2.2 and
the other ones. This maintains the principle that the LTS before the
latest branch should be very stable.

With this said, remains the problem of late fixes that I mentioned and
that are discovered during this grace period. The tricky ones can wait
in the -next branch, but the other ones should be integrated, otherwise
the nasty effect is that users think "let's not upgrade to this one but
wait for the next one so that I do not have to schedule another update
later and that I collect all fixes at once". But if we integrate
sensitive fixes in 2.2 that were not yet in a released 2.3, those
upgrading will face some breakage.

On the kernel Greg solved all this by issuing all versions very
frequently: as long as you produce updates faster than users are
willing to deploy them, they can choose what to do. It just requires
a bandwidth that we don't have :-/ Some weeks several of us work full
time on backports and tests! Right now we've reached a point where
backports can prevent us from working on mainline, and where this lack
of time increases the risk of regressions, and the regressions require
more backport time.

I think that the real problem arrives when a version becomes generally
available in distros. And distro users are often the ones with the least
autonomy when it comes to rolling back. When you build from sources,
you're more at ease. Thus probably that a nice solution would be to
add an idle period between a stable release and its appearance in
distros so that it really gets some initial deployment before becoming
generally 

Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread Vincent Bernat
 ❦ 31 mars 2021 10:35 +02, Willy Tarreau:

>> Thanks Willy for the quick update. That's a good example to avoid
>> pushing stable versions at the same time, so we have opportunities to
>> find those regressions.
>
> I know and we're trying to separate them but it considerably increases the
> required effort. In addition there is a nasty effect resulting from shifted
> releases, which is that it ultimately results in older releases possibly
> having more recent fixes than recent ones. And it will happen again with
> 2.2.12 which I hope to issue today. It will contain the small fix for the
> silent-drop issue (which is already in 2.3 of course) but was merged after
> 2.3.9. The reporter of the issue is on 2.2, it would not be fair to him to
> release another 2.2 without it (or we'd fall into a bureaucratic process
> that doesn't serve users anymore). So 2.2.12 will contain this fix. But
> if the person finally decides to upgrade to 2.3.9 a week or two later, she
> may face the bug again. It's not a dramatic one so that's acceptable, but
> that shows the difficulties of the process.

It's a bit annoying that fixes reach a LTS version before the non-LTS
one. The upgrade scenario is one annoyance, but if there is a
regression, you also impact far more users. You could tag releases in
git (with -preX if needed) when preparing the releases and then issue
the release with a few days apart. Users of older versions will have
less frequent releases in case regressions are spotted, but I think
that's the general expectation: if you are running older releases it's
because you don't have time to upgrade and it's good enough for you.

For example:
 - 2.3, monthly release or when there is a big regression
 - 2.2, 3 days after 2.3
 - 2.0, 3 days after 2.2, skip one out of two releases
 - 1.8, 3 days after 2.0, skip one out of four releases

So, you have a 2.3.9. At the same time, you tag 2.2.12-pre1 (to be
released in 3 working days if everything is fine) and you skip skip 2.0
and 1.8 this time because they were releases to match 2.3.8. Next time,
you'll have a 2.0.22-pre1 but no 1.8.30-pre1 yet.

If for some reason, there is an important regression in 2.3.9 you want
to address, you release a 2.3.10 and a 2.2.12-pre2, still no 2.0.22-pre1
nor 1.8.30-pre1. Hopefully, no more regressions spotted, you tag 2.2.12
on top of 2.2.12-pre2 and issue a release.
-- 
He hath eaten me out of house and home.
-- William Shakespeare, "Henry IV"



Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread Willy Tarreau
On Wed, Mar 31, 2021 at 10:17:35AM +0200, William Dauchy wrote:
> On Tue, Mar 30, 2021 at 6:59 PM Willy Tarreau  wrote:
> > HAProxy 2.3.9 was released on 2021/03/30. It added 5 new commits
> > after version 2.3.8.
> >
> > This essentially fixes the rate counters issue that popped up in 2.3.8
> > after the previous fix for the rate counters already.
> >
> > What happened is that the internal time in millisecond wraps every 49.7
> > days and that the new global counter used to make sure rate counters are
> > now stable across threads starts at zero and is initialized when older
> > than the current thread's current date. It just happens that the wrapping
> > happened a few hours ago at "Mon Mar 29 23:59:46 CEST 2021" exactly and
> > that any process started since this date and for the next 24 days doesn't
> > validate this condition anymore, hence doesn't rotate its rate counters
> > anymore.
> 
> Thanks Willy for the quick update. That's a good example to avoid
> pushing stable versions at the same time, so we have opportunities to
> find those regressions.

I know and we're trying to separate them but it considerably increases the
required effort. In addition there is a nasty effect resulting from shifted
releases, which is that it ultimately results in older releases possibly
having more recent fixes than recent ones. And it will happen again with
2.2.12 which I hope to issue today. It will contain the small fix for the
silent-drop issue (which is already in 2.3 of course) but was merged after
2.3.9. The reporter of the issue is on 2.2, it would not be fair to him to
release another 2.2 without it (or we'd fall into a bureaucratic process
that doesn't serve users anymore). So 2.2.12 will contain this fix. But
if the person finally decides to upgrade to 2.3.9 a week or two later, she
may face the bug again. It's not a dramatic one so that's acceptable, but
that shows the difficulties of the process.

In an ideal world, there would be lots of tests in production on stable
versions. The reality is that nobody (me included) is interested in upgrading
prod servers running flawlessly to just confirm there's no nasty surprise
with the forthcoming release, because either there's a bug and you prefer
someone else to spot it first, or there's no problem and you'll upgrade
once the final version is ready.

With this option left off the table, it's clear that the only option that
remains is the shifted versions. But here it would not even have provided
anything because the code worked on monday and broke on tuesday!

What I think we can try to do (and we discussed about this with the other
co-maintainers) is to push the patches but not immediately emit the releases
(so that the backport work is still factored), and to keep the tricky
patches in the -next branch to prevent them from being backported too far
too fast (it will save us from the risk of missing them if not merged).

Overall the most important solution is that we release often enough so
that in case of a regression that affects some users, they can stay on
the previous version a little bit more without having to endure too many
bugs. And if we don't have too many fixes per release, it's easy to emit
yet another small one immediately after to fix a single regression. But
over the last week we've been flooded on multiple channels by many reports
and then it becomes really hard to focus on a single issue at once for a
release :-/

Cheers,
Willy



Re: [ANNOUNCE] haproxy-2.3.9

2021-03-31 Thread William Dauchy
On Tue, Mar 30, 2021 at 6:59 PM Willy Tarreau  wrote:
> HAProxy 2.3.9 was released on 2021/03/30. It added 5 new commits
> after version 2.3.8.
>
> This essentially fixes the rate counters issue that popped up in 2.3.8
> after the previous fix for the rate counters already.
>
> What happened is that the internal time in millisecond wraps every 49.7
> days and that the new global counter used to make sure rate counters are
> now stable across threads starts at zero and is initialized when older
> than the current thread's current date. It just happens that the wrapping
> happened a few hours ago at "Mon Mar 29 23:59:46 CEST 2021" exactly and
> that any process started since this date and for the next 24 days doesn't
> validate this condition anymore, hence doesn't rotate its rate counters
> anymore.

Thanks Willy for the quick update. That's a good example to avoid
pushing stable versions at the same time, so we have opportunities to
find those regressions.

-- 
William