On 5/18/05, Hubert Chan <[EMAIL PROTECTED]> wrote: > >>>>> "Michael" == Michael K Edwards <[EMAIL PROTECTED]> writes: > Commenting out those lines, and compiling multi-threaded, gives > performance similar to the single-threaded case. So what does this > mean? I doubt that Ryan will want to disable THREAD_LOCAL_ALLOC > Debian-wide.
It means someone ought to beat on the spin-then-queue locking implementation enabled by THREAD_LOCAL_ALLOC until it isn't retrograde for the common single-threaded case. That's really a job for oprofile, which I'm starting to get spun up on now; but code inspection, informed by some knowledge about NPTL, might be enough. By the way, if you want to use oprofile, you might as well use the 0.8.2 release. apt-get source oprofile will get you 0.8.1; grab the 0.8.2 upstream, unpack it, grab the 0.8.2 release notes, put them in ./ReleaseNotes, copy over ./AUTHORS and ./debian from the 0.8.1 tree, add a debian/changelog entry, run ./autogen.sh (use autoconf 2.59 and automake 1.7.9), propagate over the doc fixes if you want, dpkg-buildpackage -rfakeroot, you're good to go. The oprofile module is part of stock 2.6.x kernels; you have to rebuild with install_vmlinux in /etc/kernel-pkg.conf if you want kernel profiling, but for userspace stuff the stock kernel should be OK. > I also tried compiling with THREAD_LOCAL_ALLOC, but using > GC_local_malloc instead of GC_malloc, but performance is similar to just > using GC_malloc. >From http://www.hpl.hp.com/personal/Hans_Boehm/gc/scale.html : <quote> The easiest way to switch an application to thread-local allocation is to 1. Define the macro GC_REDIRECT_TO_LOCAL, and then include the gc.h header in each client source file. 2. Invoke GC_thr_init() before any allocation. 3. Allocate using GC_MALLOC, GC_MALLOC_ATOMIC, and/or GC_GCJ_MALLOC. </quote> Oddly, -DPARALLEL_MARK may improve the situation for UP thread-local allocation, because it results in the use of an implementation of GC_malloc_many (used to refill thread-local free lists) that may be better tuned for thread-local usage patterns (as well as more concurrent). Care to give that a shot? Cheers, - Michael

