Re: [OpenSIPS-Users] Fine tuning high CPS and msyql queries

Alex Balashov Tue, 16 Jun 2020 14:37:16 -0700

Hi Calvin,

I'm really glad you were able to get things sorted out, and I apologiseif the thread got testy. I do appreciate your follow-up, which I thinkwill benefit readers looking for similar answers.


A few inline thoughts:

On 6/15/20 4:04 PM, Calvin Ellison wrote:

I attempted to reproduce the original breakdown around 3000 CPS usingthe default 212992 byte receive buffer and could not, which tells me Ibroke a cardinal rule of load testing and changed more than one thing ata time. Also, don't do load testing when tired. I suspect that I hadalso made a change to the sipp scenario recv/sched loops, or I hadunknowingly broken something while checking out the tuned package.

In several decades of doing backend systems programming, I've not foundtuning Linux kernel defaults to be generally fruitful for improvingthroughput to any non-trivial degree. The defaults are sensible foralmost all use-cases, all the more so given modern hardware andmulti-core processors and the rest.

This is in sharp contrast to the conservative defaults some applications(e.g. Apache, MySQL) ship with on many distributions. I think the ideabehind such conservative settings is to constrain the application sothat in the event of a DDoS or similar event, it does not take over allavailable hardware resources, which would impede response and resolution.

But on the kernel settings, the only impactful changes I have ever seenare minor adjustments to slightly improve very niche server loadproblems of a rather global nature (e.g. related to I/O scheduling, NICissues, storage, etc). This wasn't that kind of scenario.

In most respects, it just follows from first principles and Occam'sRazor, IMHO. There's no reason for kernels to ship tuned unnecessarilyconservatively to deny average users something on the order of _severaltimes'_ more performance from their hardware, and any effort to do thatwould be readily apparent and, it stands to reason, staunchly opposed.It therefore also stands to reason that there isn't some silver bulletor magic setting that unlocks multiplicative performance gains, if onlyone just knows the secret sauce or thinks to tweak it--for the simplereason that if such a tweak existed, it would be systemicallyrationalised away, absent a clear and persuasive basis for such anartificial and contrived limit to exist. I cannot conceive of what sucha basis would look like, and I'd like to think that's not just a failureof imagination.

Or in other words, it goes with the commonsensical, "If it seems toogood to be true, it is," intuition. The basic fundamentals of theapplication, and to a lesser but still very significant extent thehardware (in terms of its relative homogeneity nowadays), determine99.9% of the performance characteristics, and matter a thousand timesmore than literally anything one can tweak.

I deeply appreciate Alex's instance that I was wrong and to keepdigging. I am happy to retract my claim regarding "absolutely terriblesysctl defaults". Using synchronous/blocking DB queries, the 8-coreserver reached 14,000 CPS, at which point I declared it fixed and wentto bed. It could probably go higher: there's only one DB query with a<10ms response time, Memcache for the query response, and some logic todecide how to respond. There's only a single non-200 final response, soit's probably as minimalist as it gets.

I would agree that with such a minimal call processing loop, given agenerous number of CPU cores you shouldn't be terribly limited.

If anyone else is trying to tune their setup, I think Alex's advice to"not run more than 2 * (CPU threads) [children]" is the best place tostart. I had inherited this project from someone else's work underversion 1.11 and they had used 128 children. They were using remote DBservers with much higher latency than the local DBs we have today, sothat might have been the reason. Or they were just wrong to being with.

Aye. Barring a workload consisting of exceptionally latent blockingservice queries, there's really not a valid reason to ever have thatmany child processes, and even if one does have such a workload, plentyof reasons to lean on the fundamental latency problem rather thanworking around it with more child processes.

With the proviso that I am not an expert in modern-day OpenSIPSconcurrency innards, the common OpenSER heritage prescribes a preforkedworker process pool with SysV shared memory for inter-processcommunication (IPC). Like any shared memory space, this requires mutexlocking so that multiple threads (in this case, processes) don'taccess/modify the same data structures at the same time* in ways thatstep on the others. Because every process holds and waits on theselocks, this model works well when there aren't very many processes andtheir path to execution is mostly clear and not especially volatile, andwhen as little data is shared as possible. If you add a lot ofprocesses, then there's a lot of fighting among them for internal locksand for CPU time, even if the execution cycle per se is fairlyefficient. If you have 16 cores and 128 child processes, those processesare going to be fighting for those cores if they execute efficiently,while suffering from some amount of internal concurrency gridlock ifthey are not executing efficiently. Thus, 128 is for almost all casesvery far beyond the sweet spot.

By analogy, think of a large multi-lane highway where almost all carstravel at more or less a constant speed, and, vitally, almost alwaysstay in their lane, only very seldom making a lane change. As anyone whohas ever been stuck in traffic knows, small speed changes by individualactors or small groups of cars can set off huge compression waves thathave impact for miles back, and lane changes also have accordioneffects. It's not a perfect analogy by any means, but it kind of conveyssome sense of the general problem of contention. You really want to keepthe "lanes" clear and eliminate all possible sources of friction,variance, and overlap.

* For exclusion purposes; of course, there's no such thing as trulysimultaneous execution.

The Description for Asynchronous Statements is extremely tempting andwas what started me down that path; it might be missing a qualificationthat Async can be an improvement for slow blocking operations, but theadditional overhead may be a disadvantage for very fast blockingoperations.

There is indeed a certain amount of overhead in pushing data aroundmultiple threads, the locking of shared data structures involved indoing so, etc. For slow, blocking operations, there's nevertheless anadvantage, but if the operations aren't especially blocking, in manycases all that "async" stuff is just extra overhead.

Asynchronous tricks which deputise notification of the availability offurther work or I/O to the kernel can be pretty efficient, just becauselife in kernel space is pretty efficient. But async execution inuser-space requires user-space contrivances that suffer from all theproblems of user-space in turn, so the economics can be reallydifferent. Mileage of course greatly varies with the implementation details.


-- Alex

--
Alex Balashov | Principal | Evariste Systems LLC

Tel: +1-706-510-6800 / +1-800-250-5920 (toll-free)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/

_______________________________________________
Users mailing list
Users@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

Re: [OpenSIPS-Users] Fine tuning high CPS and msyql queries

Reply via email to