Re: custom linux kernel builds

gregames Tue, 17 Sep 2002 09:04:04 -0700

Ian Holsman wrote:
> 
> Hi Greg,
> 
> we are about to start into the wild wild world of linux, and I was
> wondering if you have any hints on what patches you would go with for a
> custom kernel to get maximum performance.. stuff like ingo's O(1)
> scheduler and the like..


I'm glad you asked, since I've been looking at scalability issues with Linux
lately.  Sorry for the long post - hit "delete" if you aren't interested.

We did some Linux benchmarking in a configuration similar to a reverse proxy
that takes disk file I/O and directory_walk out of the picture.  We started with
the 2.0.40 worker MPM on Red Hat 7.2 with a 2.4.9 kernel on a 2 way SMP.  We
tried numerous combinations of ThreadsPerChild and process limits such that
there were always a constant 1000 worker threads active in the server, maxed out
the CPUs, and ran oprofile 0.3.  We got the best throughput at 200
threads/process.  Then I took the oprofile sample counts for every
binary/library that used over 1% of the CPU, broke those down by function, and
then scaled the results to the throughput.  The results should be proportional
to CPU cycles per request by function.  Here are the heavy hitters:

threads per process   binary/   function name  
   20   200   500     library
   --   ---   ---
230247 38334 66517   kernel     schedule
  1742 31336 75763   libpthread __pthread_alt_unlock
  7447  6563  7341   libc       memcpy
  6661  6019  7109   libc       _IO_vfscanf
  5614  5388  5893   libc       strlen
 88825  5060 14296   kernel     stext_lock
  1276  4043  6210   kernel     do_select
  3994  3933  4239   kernel     tcp_sendmsg
  2761  3917  4071   libc       chunk_alloc
  4285  3606  3829   libapr     apr_palloc

disclaimer: the tests weren't rigidly controlled

The 5 and 6 digit numbers above are the most bothersome.  Bill S and Jeff told
me about Ingo Molnar's O(1) scheduler patch.  Reviewing the code before and
after this patch, I believe it will make a huge improvement in schedule()
cycles.  The older scheduler spends a lot of time looping thru the entire run
queue comparing "goodness" values in order to decide which task (process or
thread) is best to dispatch. That's gone with Ingo's O(1) patch.  It waits until
a task looses its time slice to recompute an internal priority value, and picks
the highest priority ready task using just a few instructions.  It turns out
that Red Hat 7.3 and Red Hat Advanced Server 2.1 already have this patch
included, so this one should be easy to solve.  dunno about other distros.

__pthread_alt_unlock() loops through a list of threads that are blocked on a
mutex to find the thread with the highest priority. I don't know which mutex
this is; I'm guessing it's the one associated with worker's condition variable. 
The ironic thing is that for httpd all the threads have the same priority AFAIK,
so these cycles don't do us much good.  I'm not aware of a patch to improve
this, so I think our choices for scalability in the mean time are keeping
ThreadsPerChild
very low or prefork.

The stext_lock() cycles mean that we're getting contention on some kernel
spinlock.  I don't know which lock yet.  The scheduler uses spinlocks on SMPs,
and the stext_lock cycles above sort of track the scheduler cycles so I'm hoping
that might be it.  There's a tool called lockmeter on SourceForge that can
provide spinlock usage statistics in case those cycles don't go away with the
O(1) scheduler patch.

Since we also had problems getting core dumps reliably with a threaded MPM in
addition to the pthread scalability issue, we decided to switch over to
prefork.  That gives better SPECWeb99 results at the moment too.  Then we
started hitting seg faults in pthread initialization code in child processes
during fork() when trying to start 2000 processes.  

It turns out that dmesg had tons of "ldt allocation failure" messages. 
linuxthreads uses the ldt on i386 to address its structure representing the
current thread.  Since apr is linked with threads enabled by default on Linux,
each child is assigned a 64K ldt out of kernel memory with only one entry for
thread 0 out of 8K used.  64K may not seem like much these days, but one one
machine we had 900M of RAM (according to free) and were trying for 10,000
concurrent connections, which works out to 90K of RAM per process with prefork.  

Configuring apr with --disable-threads makes the ldts a non-issue, but that
raises a concern for reliability of binaries when there are 3rd party modules
such as Vignette which are threaded.  It's pretty easy to patch the kernel to
reduce the size of the ldts from the i386 architected max of 8K entries each. 
That reduces that maximum number of threads per process (which might not be a
bad thing for httpd at the moment), and of course there will be a lot of users
unwilling to rebuild the Linux kernel.  With either --disable-threads in apr or
the ldts limited to 256 entries in the kernel, it's no problem starting 10,000
child processes.  You can also give the kernel a bigger chunk of RAM, but I
decided not to take away memory from user space on the box with 900M.  

I've heard rumors of a patch that make coredumps more reliable with threads, but
don't know any details.  If that pans out, maybe the answer for scalability +
reliability + no user custom builds is to go with worker with a small
ThreadsPerProcess number.  

Greg

Re: custom linux kernel builds

Reply via email to