Hello,

On Wed, 10 Jul 2024 at 16:39, Willy Tarreau <w...@1wt.eu> wrote:
>
> Another change that will need to be backported after some time concerns
> the handling of default FD limits. For a few decades, operating systems
> would advertise safe limits (i.e. those they were able to deal with based
> on their amount of RAM). We've seen a few attempts at bumping the hard
> limit beyond 1 billion FDs many years ago that were reverted due to
> breaking many apps. Now it seems it's coming back, via systemd-256 setting
> the hard-limit from the kernel's nr_open variable (which in itself is not
> necessarily a bad thing -- proof that I'm not always bashing systemd, only
> when needed :-)). But with some machines showing extreme nr_open (I still
> don't know why) we're back to square one where it's possible for haproxy
> to try to start with a limit set to one billion FDs. Not only this would
> eat at least 64GB of RAM just for the fdtab itself, it also takes ages to
> start, and fortunately the watchdog quickly puts an end to this mess...
> We already have an fd-hard-limit global setting that allows to fix a hard
> limit to the number of FDs, but not everyone knows about it nor uses it.
> What we've found to be the most reasonable is to consider that
> fd-hard-limit now has a default value of 1048576, which matches what was
> almost always the default hard limit, so that when not set, it's like it
> used to be till now. That's sufficient for the vast majority of use cases,
> and trust me, the rare users who need to support more than 500k concurrent
> connections are pretty much aware of all related tunables and already make
> use of them, so it's expected that nobody should observe any change.

I wholeheartedly hate default implicit limits and I also pretty much
disagree with fd-hard-limit in general, but allow me to quote your own
post here from github issue #2043 comment
https://github.com/haproxy/haproxy/issues/2043#issuecomment-1433593837

> we used to have a 2k maxconn limit for a very long time and it was causing
> much more harm than such an error: the process used to start well and was
> working perfectly fine until the day there was a big rush on the site and it
> wouldn't accept more connections than the default limit. I'm not that much
> tempted by setting new high default limits. We do have some users running
> with 2+ million concurrent connections, or roughly 5M FDs. That's already way
> above what most users would consider an acceptable default limit, and anything
> below this could mean that such users wouldn't know about the setting and 
> could
> get trapped.

I disagree that we need to heuristically guess those values like I
believe I said in the past.

"But containers ..." should not be an argument to forgo the principle
of least surprise.

There are ways to push defaults like this out if really needed: with
default configuration files, like we have in examples/ and like
distributions provide in their repositories. This default the users
will then find in the configuration file and can look it up in the
documentation if they want.

At the very least we need a *stern* configuration warning that we now
default to 1M fd, although I would personally consider this (lack of
all fd-hard-limit, ulimit and global maxconn) leading to heuristic
fd-hard-limit a critical error.

I also consider backporting this change - even with a configuration
warning - dangerous.


cheers,
lukas

Reply via email to