Ian Holsman wrote: > > Hi Greg, > > we are about to start into the wild wild world of linux, and I was > wondering if you have any hints on what patches you would go with for a > custom kernel to get maximum performance.. stuff like ingo's O(1) > scheduler and the like..
I'm glad you asked, since I've been looking at scalability issues with Linux lately. Sorry for the long post - hit "delete" if you aren't interested. We did some Linux benchmarking in a configuration similar to a reverse proxy that takes disk file I/O and directory_walk out of the picture. We started with the 2.0.40 worker MPM on Red Hat 7.2 with a 2.4.9 kernel on a 2 way SMP. We tried numerous combinations of ThreadsPerChild and process limits such that there were always a constant 1000 worker threads active in the server, maxed out the CPUs, and ran oprofile 0.3. We got the best throughput at 200 threads/process. Then I took the oprofile sample counts for every binary/library that used over 1% of the CPU, broke those down by function, and then scaled the results to the throughput. The results should be proportional to CPU cycles per request by function. Here are the heavy hitters: threads per process binary/ function name 20 200 500 library -- --- --- 230247 38334 66517 kernel schedule 1742 31336 75763 libpthread __pthread_alt_unlock 7447 6563 7341 libc memcpy 6661 6019 7109 libc _IO_vfscanf 5614 5388 5893 libc strlen 88825 5060 14296 kernel stext_lock 1276 4043 6210 kernel do_select 3994 3933 4239 kernel tcp_sendmsg 2761 3917 4071 libc chunk_alloc 4285 3606 3829 libapr apr_palloc disclaimer: the tests weren't rigidly controlled The 5 and 6 digit numbers above are the most bothersome. Bill S and Jeff told me about Ingo Molnar's O(1) scheduler patch. Reviewing the code before and after this patch, I believe it will make a huge improvement in schedule() cycles. The older scheduler spends a lot of time looping thru the entire run queue comparing "goodness" values in order to decide which task (process or thread) is best to dispatch. That's gone with Ingo's O(1) patch. It waits until a task looses its time slice to recompute an internal priority value, and picks the highest priority ready task using just a few instructions. It turns out that Red Hat 7.3 and Red Hat Advanced Server 2.1 already have this patch included, so this one should be easy to solve. dunno about other distros. __pthread_alt_unlock() loops through a list of threads that are blocked on a mutex to find the thread with the highest priority. I don't know which mutex this is; I'm guessing it's the one associated with worker's condition variable. The ironic thing is that for httpd all the threads have the same priority AFAIK, so these cycles don't do us much good. I'm not aware of a patch to improve this, so I think our choices for scalability in the mean time are keeping ThreadsPerChild very low or prefork. The stext_lock() cycles mean that we're getting contention on some kernel spinlock. I don't know which lock yet. The scheduler uses spinlocks on SMPs, and the stext_lock cycles above sort of track the scheduler cycles so I'm hoping that might be it. There's a tool called lockmeter on SourceForge that can provide spinlock usage statistics in case those cycles don't go away with the O(1) scheduler patch. Since we also had problems getting core dumps reliably with a threaded MPM in addition to the pthread scalability issue, we decided to switch over to prefork. That gives better SPECWeb99 results at the moment too. Then we started hitting seg faults in pthread initialization code in child processes during fork() when trying to start 2000 processes. It turns out that dmesg had tons of "ldt allocation failure" messages. linuxthreads uses the ldt on i386 to address its structure representing the current thread. Since apr is linked with threads enabled by default on Linux, each child is assigned a 64K ldt out of kernel memory with only one entry for thread 0 out of 8K used. 64K may not seem like much these days, but one one machine we had 900M of RAM (according to free) and were trying for 10,000 concurrent connections, which works out to 90K of RAM per process with prefork. Configuring apr with --disable-threads makes the ldts a non-issue, but that raises a concern for reliability of binaries when there are 3rd party modules such as Vignette which are threaded. It's pretty easy to patch the kernel to reduce the size of the ldts from the i386 architected max of 8K entries each. That reduces that maximum number of threads per process (which might not be a bad thing for httpd at the moment), and of course there will be a lot of users unwilling to rebuild the Linux kernel. With either --disable-threads in apr or the ldts limited to 256 entries in the kernel, it's no problem starting 10,000 child processes. You can also give the kernel a bigger chunk of RAM, but I decided not to take away memory from user space on the box with 900M. I've heard rumors of a patch that make coredumps more reliable with threads, but don't know any details. If that pans out, maybe the answer for scalability + reliability + no user custom builds is to go with worker with a small ThreadsPerProcess number. Greg