Hi Calvin,

I'm really glad you were able to get things sorted out, and I apologise if the thread got testy. I do appreciate your follow-up, which I think will benefit readers looking for similar answers.

A few inline thoughts:

On 6/15/20 4:04 PM, Calvin Ellison wrote:

I attempted to reproduce the original breakdown around 3000 CPS using the default 212992 byte receive buffer and could not, which tells me I broke a cardinal rule of load testing and changed more than one thing at a time. Also, don't do load testing when tired. I suspect that I had also made a change to the sipp scenario recv/sched loops, or I had unknowingly broken something while checking out the tuned package.

In several decades of doing backend systems programming, I've not found tuning Linux kernel defaults to be generally fruitful for improving throughput to any non-trivial degree. The defaults are sensible for almost all use-cases, all the more so given modern hardware and multi-core processors and the rest.

This is in sharp contrast to the conservative defaults some applications (e.g. Apache, MySQL) ship with on many distributions. I think the idea behind such conservative settings is to constrain the application so that in the event of a DDoS or similar event, it does not take over all available hardware resources, which would impede response and resolution.

But on the kernel settings, the only impactful changes I have ever seen are minor adjustments to slightly improve very niche server load problems of a rather global nature (e.g. related to I/O scheduling, NIC issues, storage, etc). This wasn't that kind of scenario.

In most respects, it just follows from first principles and Occam's Razor, IMHO. There's no reason for kernels to ship tuned unnecessarily conservatively to deny average users something on the order of _several times'_ more performance from their hardware, and any effort to do that would be readily apparent and, it stands to reason, staunchly opposed. It therefore also stands to reason that there isn't some silver bullet or magic setting that unlocks multiplicative performance gains, if only one just knows the secret sauce or thinks to tweak it--for the simple reason that if such a tweak existed, it would be systemically rationalised away, absent a clear and persuasive basis for such an artificial and contrived limit to exist. I cannot conceive of what such a basis would look like, and I'd like to think that's not just a failure of imagination.

Or in other words, it goes with the commonsensical, "If it seems too good to be true, it is," intuition. The basic fundamentals of the application, and to a lesser but still very significant extent the hardware (in terms of its relative homogeneity nowadays), determine 99.9% of the performance characteristics, and matter a thousand times more than literally anything one can tweak.

I deeply appreciate Alex's instance that I was wrong and to keep digging. I am happy to retract my claim regarding "absolutely terrible sysctl defaults". Using synchronous/blocking DB queries, the 8-core server reached 14,000 CPS, at which point I declared it fixed and went to bed. It could probably go higher: there's only one DB query with a <10ms response time, Memcache for the query response, and some logic to decide how to respond. There's only a single non-200 final response, so it's probably as minimalist as it gets.

I would agree that with such a minimal call processing loop, given a generous number of CPU cores you shouldn't be terribly limited.

If anyone else is trying to tune their setup, I think Alex's advice to "not run more than 2 * (CPU threads) [children]" is the best place to start. I had inherited this project from someone else's work under version 1.11 and they had used 128 children. They were using remote DB servers with much higher latency than the local DBs we have today, so that might have been the reason. Or they were just wrong to being with.

Aye. Barring a workload consisting of exceptionally latent blocking service queries, there's really not a valid reason to ever have that many child processes, and even if one does have such a workload, plenty of reasons to lean on the fundamental latency problem rather than working around it with more child processes.

With the proviso that I am not an expert in modern-day OpenSIPS concurrency innards, the common OpenSER heritage prescribes a preforked worker process pool with SysV shared memory for inter-process communication (IPC). Like any shared memory space, this requires mutex locking so that multiple threads (in this case, processes) don't access/modify the same data structures at the same time* in ways that step on the others. Because every process holds and waits on these locks, this model works well when there aren't very many processes and their path to execution is mostly clear and not especially volatile, and when as little data is shared as possible. If you add a lot of processes, then there's a lot of fighting among them for internal locks and for CPU time, even if the execution cycle per se is fairly efficient. If you have 16 cores and 128 child processes, those processes are going to be fighting for those cores if they execute efficiently, while suffering from some amount of internal concurrency gridlock if they are not executing efficiently. Thus, 128 is for almost all cases very far beyond the sweet spot.

By analogy, think of a large multi-lane highway where almost all cars travel at more or less a constant speed, and, vitally, almost always stay in their lane, only very seldom making a lane change. As anyone who has ever been stuck in traffic knows, small speed changes by individual actors or small groups of cars can set off huge compression waves that have impact for miles back, and lane changes also have accordion effects. It's not a perfect analogy by any means, but it kind of conveys some sense of the general problem of contention. You really want to keep the "lanes" clear and eliminate all possible sources of friction, variance, and overlap.

* For exclusion purposes; of course, there's no such thing as truly simultaneous execution.

The Description for Asynchronous Statements is extremely tempting and was what started me down that path; it might be missing a qualification that Async can be an improvement for slow blocking operations, but the additional overhead may be a disadvantage for very fast blocking operations.

There is indeed a certain amount of overhead in pushing data around multiple threads, the locking of shared data structures involved in doing so, etc. For slow, blocking operations, there's nevertheless an advantage, but if the operations aren't especially blocking, in many cases all that "async" stuff is just extra overhead.

Asynchronous tricks which deputise notification of the availability of further work or I/O to the kernel can be pretty efficient, just because life in kernel space is pretty efficient. But async execution in user-space requires user-space contrivances that suffer from all the problems of user-space in turn, so the economics can be really different. Mileage of course greatly varies with the implementation details.

-- Alex

--
Alex Balashov | Principal | Evariste Systems LLC

Tel: +1-706-510-6800 / +1-800-250-5920 (toll-free)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/

_______________________________________________
Users mailing list
Users@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

Reply via email to