hey all.
i've comitted the new memory allcoator to glib HEAD now.
it replaces the old slow and memory bloated memchunk implementation
and should provide significant speedups and memory savings over
using the systems malloc()/free() implementations.
gslice.c contains a large comment which outlines the basic
allocator structure, i'm appending that to this email for the
interested reader, and the really curious ones can continue with
reading the two slab allocator papers mentioned in the comment which
served as a basis for the implementation ;)
to sum it up, the new slab allocator uses posix_memalign(3) and free(1)
to optimize allocations of many euqally sized chunks, and we now have
per thread free lists (the magazine layer) to quickly satisfy allocation
requests of already known structure sizes.
this is accompanied by extra caching logic that can keep around freed
memory for approximately 15 seconds before returning it to the system.
memory left over from alignment constraints is used for cache colorization
(random distribution of chunk addresses) to improve processor cache
utilization. to improve SMP scalability, the caching layer can adapt
itself to high lock contention as it can occour with multiple threads on
SMP systems.
there's also a new test/profiling program glibs/tests/slice-test now,
which allocates lots of randomly sized structures and releases/reallocates
those in random order.
here're the timing comparisons of system malloc vs. just the new slab
allocator (used internally by gslice.c) and then the slab allocator plus
magazine cache as it's available through g_slice_new() and g_slice_free():
g_malloc/g_free:
Starting 1 threads allocating random blocks = 512 bytes with seed=123456789
using system malloc
real0m39.502s
internal slab allocator:
Starting 1 threads allocating random blocks = 512 bytes with seed=123456789
using slab allocator
real0m20.064s
g_slice_*() (slab allocator + magazine cache):
Starting 1 threads allocating random blocks = 512 bytes with seed=123456789
using slab allocator + magazine cache
real0m13.888s
that's single threaded on an 1.8GHz Athlon 512KB cache, glibc-2.3.2. malloc
takes allmost 3 times as long as gslice.c here.
for greater block sizes, malloc performs a lot worse:
g_malloc/g_free:
Starting 1 threads allocating random blocks = 1024 bytes with seed=123456789
using slab allocator + magazine cache
real0m16.784s
internal slab allocator:
Starting 1 threads allocating random blocks = 1024 bytes with seed=123456789
using slab allocator
real0m26.040s
g_slice_*() (slab allocator + magazine cache):
Starting 1 threads allocating random blocks = 1024 bytes with seed=123456789
using system malloc
real1m40.123s
and here's a test across multiple CPUs (window.gnome.org, QuadXeon, 2.8GHz,
unfortunately tested under load and without support for sched_setaffinity(2)):
Starting 1 threads allocating random blocks = 512 bytes with seed=123456789
using slab allocator + magazine cache
real0m9.099s
Starting 2 threads allocating random blocks = 512 bytes with seed=123456789
using slab allocator + magazine cache
real0m9.945s
Starting 3 threads allocating random blocks = 512 bytes with seed=123456789
using slab allocator + magazine cache
real0m14.043s
Starting 4 threads allocating random blocks = 512 bytes with seed=123456789
using slab allocator + magazine cache
real0m18.680s
the seed of the random block allocator was fixed to ensure the same allocation
pattern was used to produce comparable runs.
for large block sizes, g_slice_new() and g_slice_alloc() will automatically
delegate to the system malloc implementation. so for newly written code, as
long as objects are not resized during their lifetime and the obejct size
used at allocation time is still available when freeing, it is recommended
to use the new g_slice_*() API instead of g_malloc() and friends.
the allocator can be considered mostly ready at this point, what's left on
my TODO for it is:
- some additional test/profiling programs that should go into CVS
- look into facilitating __thread instead of g_private_get() to speed up
the common code path
- do controlled profiling on multi-CPU machines to measure scalability with
increasing numbers of processors.
- do more realistic profiling, preferably with memory allocation intensive
apps that don't require user interaction.
especially in the latter area, help is apprechiated, i.e. if people have
specific memory benchmarking scenarios that can be made to work with glib
head, please speak up ;)
from gslice.c:
/* the GSlice allocator is split up into 4 layers, roughly modelled after the
slab
* allocator and magazine extensions as outlined in:
* + [Bonwick94] Jeff Bonwick, The slab allocator: An object-caching kernel
* memory allocator. USENIX 1994,
http://citeseer.ist.psu.edu/bonwick94slab.html
* + [Bonwick01] Bonwick and Jonathan Adams, Magazines and vmem: Extending