Hi Calvin,
I'm really glad you were able to get things sorted out, and I apologise
if the thread got testy. I do appreciate your follow-up, which I think
will benefit readers looking for similar answers.
A few inline thoughts:
On 6/15/20 4:04 PM, Calvin Ellison wrote:
I attempted to reproduce the original breakdown around 3000 CPS using
the default 212992 byte receive buffer and could not, which tells me I
broke a cardinal rule of load testing and changed more than one thing at
a time. Also, don't do load testing when tired. I suspect that I had
also made a change to the sipp scenario recv/sched loops, or I had
unknowingly broken something while checking out the tuned package.
In several decades of doing backend systems programming, I've not found
tuning Linux kernel defaults to be generally fruitful for improving
throughput to any non-trivial degree. The defaults are sensible for
almost all use-cases, all the more so given modern hardware and
multi-core processors and the rest.
This is in sharp contrast to the conservative defaults some applications
(e.g. Apache, MySQL) ship with on many distributions. I think the idea
behind such conservative settings is to constrain the application so
that in the event of a DDoS or similar event, it does not take over all
available hardware resources, which would impede response and resolution.
But on the kernel settings, the only impactful changes I have ever seen
are minor adjustments to slightly improve very niche server load
problems of a rather global nature (e.g. related to I/O scheduling, NIC
issues, storage, etc). This wasn't that kind of scenario.
In most respects, it just follows from first principles and Occam's
Razor, IMHO. There's no reason for kernels to ship tuned unnecessarily
conservatively to deny average users something on the order of _several
times'_ more performance from their hardware, and any effort to do that
would be readily apparent and, it stands to reason, staunchly opposed.
It therefore also stands to reason that there isn't some silver bullet
or magic setting that unlocks multiplicative performance gains, if only
one just knows the secret sauce or thinks to tweak it--for the simple
reason that if such a tweak existed, it would be systemically
rationalised away, absent a clear and persuasive basis for such an
artificial and contrived limit to exist. I cannot conceive of what such
a basis would look like, and I'd like to think that's not just a failure
of imagination.
Or in other words, it goes with the commonsensical, "If it seems too
good to be true, it is," intuition. The basic fundamentals of the
application, and to a lesser but still very significant extent the
hardware (in terms of its relative homogeneity nowadays), determine
99.9% of the performance characteristics, and matter a thousand times
more than literally anything one can tweak.
I deeply appreciate Alex's instance that I was wrong and to keep
digging. I am happy to retract my claim regarding "absolutely terrible
sysctl defaults". Using synchronous/blocking DB queries, the 8-core
server reached 14,000 CPS, at which point I declared it fixed and went
to bed. It could probably go higher: there's only one DB query with a
<10ms response time, Memcache for the query response, and some logic to
decide how to respond. There's only a single non-200 final response, so
it's probably as minimalist as it gets.
I would agree that with such a minimal call processing loop, given a
generous number of CPU cores you shouldn't be terribly limited.
If anyone else is trying to tune their setup, I think Alex's advice to
"not run more than 2 * (CPU threads) [children]" is the best place to
start. I had inherited this project from someone else's work under
version 1.11 and they had used 128 children. They were using remote DB
servers with much higher latency than the local DBs we have today, so
that might have been the reason. Or they were just wrong to being with.
Aye. Barring a workload consisting of exceptionally latent blocking
service queries, there's really not a valid reason to ever have that
many child processes, and even if one does have such a workload, plenty
of reasons to lean on the fundamental latency problem rather than
working around it with more child processes.
With the proviso that I am not an expert in modern-day OpenSIPS
concurrency innards, the common OpenSER heritage prescribes a preforked
worker process pool with SysV shared memory for inter-process
communication (IPC). Like any shared memory space, this requires mutex
locking so that multiple threads (in this case, processes) don't
access/modify the same data structures at the same time* in ways that
step on the others. Because every process holds and waits on these
locks, this model works well when there aren't very many processes and
their path to execution is mostly clear and not especially volatile, and
when as little data is shared as