>Number: 4318 >Category: os-solaris >Synopsis: Run queue spikes occur with many instances of Apache (parent >procs become synchronised) >Confidential: no >Severity: non-critical >Priority: medium >Responsible: apache >State: open >Class: sw-bug >Submitter-Id: apache >Arrival-Date: Wed Apr 28 00:00:01 PDT 1999 >Last-Modified: >Originator: [EMAIL PROTECTED] >Organization: apache >Release: 1.3.4 >Environment: Solaris 2.5.1 on Ultra-Enterprise server system. Compiler unknown - not a compiler issue. >Description: This problem is occurring at a customer site who are running 1000 separate apache instances on a large Sun server (web hosting service). They are using our ShareII resource management product to provide service guarantees to separate client domains (which is why there are so many servers: 1 per customer). The problem is occurring with the parent procs becoming synchronised due to kernel implementation of the waitpid(2) call (as used in main/http_main.c:wait_or_timeout()) and other kernel internals. The effect is to produce very large run-queue spikes (400 or more) when the synchronised parents intersect the run-queue sampling code. The run-queue spikes cause other daemons (eg sendmail) to behave strangely. While this is essentially a kernel implementation problem, it is triggered only by the apache parent implementation. >How-To-Repeat: I can supply some test code or put you in contact with our customer if needed. I can also supply sar output, truss output and kernel traces if you think that will help :-) >Fix: I have two suggestions - one easy, one a little more difficult. The more difficult but "correct" approach is to utilise the SIGCHLD signal in the parent to set a "child is dead" flag and interrupt the scoreboard maintenance sleep. The waitpid() call should only be made if a SIGCHLD has been received. This approach will work on just about all variants of Unix and is not specific to Solaris.
The quick-and-dirty approach is to add some random jitter to the timeout period (SCOREBOARD_MAINTENANCE_INTERVAL) in main/http_main.c:wait_or_timeout(). I have tried out the following code, which adds a tunable amount of jitter: old: tv.tv_sec = SCOREBOARD_MAINTENANCE_INTERVAL / 1000000; tv.tv_usec = SCOREBOARD_MAINTENANCE_INTERVAL % 1000000; ap_select(0, NULL, NULL, NULL, &tv); new: #define JITTER_PERCENT 10 /* Actual delay will be plus or minus this much */ { time_t delaytime = SCOREBOARD_MAINTENANCE_INTERVAL; static int seeded; static unsigned int seed; if (!seeded) { ++seeded; seed = getpid(); } /* delaytime +/- selected randomness avoiding overflow and unsigned arith */ delaytime += (((long)(rand_r(&seed) * (delaytime >> 8)) >> 6) - (long)de laytime) / (100 / JITTER_PERCENT); tv.tv_sec = delaytime / 1000000; tv.tv_usec = delaytime % 1000000; } ap_select(0, NULL, NULL, NULL, &tv); Sleeping for a (uniformly distributed) random time should break up the convoys of synchronised apache parents. >Audit-Trail: >Unformatted: [In order for any reply to be added to the PR database, ] [you need to include <[EMAIL PROTECTED]> in the Cc line ] [and leave the subject line UNCHANGED. This is not done] [automatically because of the potential for mail loops. ] [If you do not include this Cc, your reply may be ig- ] [nored unless you are responding to an explicit request ] [from a developer. ] [Reply only with text; DO NOT SEND ATTACHMENTS! ]