> My intial impression is that a page should be 1024 bytes (though I'm also
> looking at 512 bytes to reduce the shotgun effect of 8-byte allocs).
> I've actually found that "q-caches" were detramental to things like
> conslidation, and the calling depth (overhead) of a worst-case allocation.
> So I'm currently leaving them out.  Instead I'm using a power-of-two basd
> dq-cache (for differential q-cache).

Forgot to include two other points on this topic.  First, if q-caches were 
used, the slab-header would require a whole page (so as to maintain 
alignment).  With a q-max of size 4 at most and a slab object multiple of 4,  
we have 16 pages as the minimum allocation actually fullfilled by vmem.  
However, pages are of 1k-alignment, which you'll note tells vmem that it 
should look in it's segment-hashtable.  Thus a q-cache in this system could 
not simply use multiples of a page size.  On the other hand, an alternative 
would be to subdivide pages into two regions instead of one.  If it assumed 
that accesses within the q-cache region will be rare (relegated mostly to 
spills into vmem by the dq-cache), then the q-cache could be considered a 
second tier subdivision of a page.  For example, if the page-size was raised 
to 16K (must be a power-of-two for efficient page-alignment calculations), 
the dq-caches would still utilize a slab-size of 1k and define dq's of sizes: 
8,16,24,...384, but the q-cache would utilize a slab-size of 16K (our new 
page size) and define q's of sizes: 1,2,3,4,5.  Note that dq's use power-of 
2/3 to efficiently cover a broad range while q's use multiples so as to 
minimize slack space.  Here, however, our q-cache allocations have an 
incredible slack space due to the forced subdivisions of 16 as the following 
table shows:

page-size 16K
q-cache-size : num-objects : header-overhead : slack-space : % overhead 
(including slack )
1 : 15 : 1 : 0 : 6.66%
2 : 7   : 1 : 1 : 14.2%
3 : 5   : 1 : 0 : 6.66%
4 : 3   : 1 : 3 : 33%
5 : 3   : 1 : 0 : 6.66%

There are several trade-offs.  The larger the slab-size, the less max 
overhead for any q-cache (here we have 33% max overhead  for optimial 
alignments: in the worst case, an allocation of size 3.01k, for example would 
have 77% slack).  However, an allocation of size 6k would still require a 
full 16K allocation by vmem, since we want to minimize fragmentation.  That's 
an overhead of 166% (wasting 10k of memory).  If we wanted to reduce both the 
percent and max physically wasted memory, we could run the q-cache all the 
way until size == 1 (at which point the slab is of little or no use):

page size = 8K
1 : 7 : 1 : 0 : 12.5%
2 : 3 : 1 : 1 : 25%
3 : 2 : 1 : 1 : 25%

We round sizes 4,5, 6 and 7 up to an allocation of size 8.  Thus the max 
overhead is 3.1k -> 8k => 166% which wastes over 4k of physical memory.
Note also that we incur at least 12.5% overhead.  It's debatable whether 8K 
or 16K is better.  Both have worst cases of 166%, but the 16K wastes more 
total physical memory.  Compare this to the following table of power-2 
allocations for dq-caches:

virtual page size = 1k, slab header size is approx 28 bytes
8  : 124 : 4 : 4 : 3.2%
12 : 83  : 2 : 0 : 2.811%
16 : 62  : 2 : 4 : 3.2%
24 : 41  : 1 : 12 : 4%
32 : 31  : 1 : 4  : 3.2%
48 : 20  : 1 : 36 : 6.6% 
64 : 15  : 1 : 36  : 6.6%
96 : 10  : 0 : 64  : 6.6%
128 : 7  : 1 : 100 : 14.2%
192 : 5  : 0 : 36  : 6.6%
256 : 3  : 1 : 228  : 33%
384 : 2  : 0 : 228 : 33%

BUT, these first tier dq-caches all need 1k allocations from vmem, which 
passes the request off to the q-caches, which incure an additional 12.5% 
overhead for 8K page or 6.66% overhead for 16K pages.  

Additionally note that the overhead can be reduced somewhat if we could use 
an external header for slabs of object-size >= .5 vpage.  But unfortunately, 
the first object of this slab falls on a page boundry, and thus will be 
inappropriatedly interpreted by vmem_size and vmem_free.  Note, once again, 
that explicit object caches do not use vmem directly and are thus shielded 
from this page/ vpage limitation.  This only applies to the internal dq-cache 
and the debated q-cache, not the explicitly allocated object-caches.

More important than slack space is the shot-gun effect, which is exacerbated 
since we now have 7 layers in which a freed memory object may reside:

8B dq-cache magazine (1 of 20; assuming max-stack-size = 5, data/prev_data 
are both full, and there are two threads)
8B dq-cache depot ( 1 of 50; assuming max-stacks = 5 ) 
8B slab ( 1 of 124 )
1024B q-cache magazine ( 1 of 8; assuming max-stack-size = 2, data/prev_data 
are both full and there are two threads)
1024B q-cache depot ( 1 of 50 )
1024B slab ( 1 of  7 )
vmem segment ( 1 of 255 ) (segment size = 255 * 8 * 1kvpage = 2 Meg) 

That's 347 million non-vmem hiding places for a free'd object.  A reclaim 
currently is only slated to free depot's, but reduces us to 3.47 
million hiding places.  If we do away with the second tier (and the vpage 
concept), we have 124 thousand non-vmem hiding places.  With reclaim, we have 
just over 2 thousand (since these are approx numbers).  Note that this 
doesn't take into account the number of allocated slabs for the given depot, 
which is dependant on both the number of allocations, and the probability of 
slack due to shotgun (which thus means it's a recursive probability).  If 
magazine-caches were lockable, then for a full free, we'd have 100% 
probability of freeing all objects back to vmem and thus to the OS (unless 
the user dynamically allocated depots, slabs and caches, which violates our 
100% freeing assumption).  This MIGHT, however, be construed as a good thing 
(tm) as follows.

While it is important to regularly free memory back to the vmem paging system 
so that different data-type-sizes can reuse each other's memory space, each 
page-size freed to vmem requires hash-lookups and consolidation, and possibly 
even an OS call.  One thing I noticed between GNU's malloc and perl5's 
default malloc was that GNU's consolidation time was a killer when millions 
of small allocs were performed.  In one memory intensive app I wrote (for the 
XML compression contest) a full 50% or more of the run time (10+ minutes) was 
dedicated to consolidation.  I know this because when I performed an 
immediate exit at the end of execution instead of letting perl deallocate  
everything, it took half as long.  When I replaced GNU's malloc with that of 
perl5, there was almost zero post "return" time.  Perl5 deallocated each and 
every object via reference decimation, but even this took almost no time.. 
Consolidation is the only real difference between the two algorithms.  I 
believe one element of this consolidation was the insertion sort used in the 
free lists which allows for a faster best-fit algorithms (at the obvious 
expense of free times).  In comparison, I utilize best-fit more 
memory-efficiently than SUNs proposed method (of instant-fit, which uses the 
first free in a power-of-2 free-chain)  AND faster (constant-bounded-time: 
max=255 link lookups on allocs, with constant-time on deallocs) than GNUs 
linked-list lookup (O(n) due to unbounded size n).  Thus my algorithm should 
be nearly as fast as SUNs for allocs / frees (which is a little simpler), but 
I've removed many of the restrictions and memory innefficiencies with only 
marginal extra performance cost.  

At the time of the writing of SUNs paper, they recognized that a direct 
user-space implementation of vmem was problematic (4B unit-page-size, no easy 
access to cpu-id as within the kernel, and each process must initialize the 
memory manager as opposed to only once per boot w/in the kernel).  I believe 
I've alleviated the 4B page problem (by converting q-cache to dq-cache), 
alleviated the cpu-id problem by making unlocked magazine-caches per-thread 
instead of per-CPU (though this adds extra thread-creation time overhead, as 
well as increases the number of magazine locations where free'd objects may 
hide (avoiding heterogenious reuse)), and most importantly, found a balance 
between external tagging management and memory overhead by avoiding the 
proliferation of mini-data-structures.  One admitted problem by SUN was that 
if your memory was full and you perform a free, you bomb-out when trying to 
allocate a new magazine to store the free'd object.  Here except for 
user-object-caches, we avoid the problem.

Lastly as for process-init overhead ( calling vmem_init() ), only the 
following need occur:

init seg-hash (128 trivial cells), mm-hash (128 trivial cells), free-lists 
(256 trivial cells), 

init the link-list pointers of: available-free-lists, depot_list, slab_list

create depots for each dq-cache (which in turn creates slabs and 
thread-specific-data-keys for thread-specific magazine-caches).  A destructor 
is associated with each depot-key so as to remove the TSD magazine cache on 
thread-exit.

create magazine-caches for each depot (using global memory)  (additionally 
stores in TSD for the associated depot).

For each new thread

The first time it accesses a depot, a new TSD based magazine-cache is created 
(which is reclaimed on thread-exit).

-Michael

Reply via email to