> My intial impression is that a page should be 1024 bytes (though I'm also > looking at 512 bytes to reduce the shotgun effect of 8-byte allocs). > I've actually found that "q-caches" were detramental to things like > conslidation, and the calling depth (overhead) of a worst-case allocation. > So I'm currently leaving them out. Instead I'm using a power-of-two basd > dq-cache (for differential q-cache).
Forgot to include two other points on this topic. First, if q-caches were used, the slab-header would require a whole page (so as to maintain alignment). With a q-max of size 4 at most and a slab object multiple of 4, we have 16 pages as the minimum allocation actually fullfilled by vmem. However, pages are of 1k-alignment, which you'll note tells vmem that it should look in it's segment-hashtable. Thus a q-cache in this system could not simply use multiples of a page size. On the other hand, an alternative would be to subdivide pages into two regions instead of one. If it assumed that accesses within the q-cache region will be rare (relegated mostly to spills into vmem by the dq-cache), then the q-cache could be considered a second tier subdivision of a page. For example, if the page-size was raised to 16K (must be a power-of-two for efficient page-alignment calculations), the dq-caches would still utilize a slab-size of 1k and define dq's of sizes: 8,16,24,...384, but the q-cache would utilize a slab-size of 16K (our new page size) and define q's of sizes: 1,2,3,4,5. Note that dq's use power-of 2/3 to efficiently cover a broad range while q's use multiples so as to minimize slack space. Here, however, our q-cache allocations have an incredible slack space due to the forced subdivisions of 16 as the following table shows: page-size 16K q-cache-size : num-objects : header-overhead : slack-space : % overhead (including slack ) 1 : 15 : 1 : 0 : 6.66% 2 : 7 : 1 : 1 : 14.2% 3 : 5 : 1 : 0 : 6.66% 4 : 3 : 1 : 3 : 33% 5 : 3 : 1 : 0 : 6.66% There are several trade-offs. The larger the slab-size, the less max overhead for any q-cache (here we have 33% max overhead for optimial alignments: in the worst case, an allocation of size 3.01k, for example would have 77% slack). However, an allocation of size 6k would still require a full 16K allocation by vmem, since we want to minimize fragmentation. That's an overhead of 166% (wasting 10k of memory). If we wanted to reduce both the percent and max physically wasted memory, we could run the q-cache all the way until size == 1 (at which point the slab is of little or no use): page size = 8K 1 : 7 : 1 : 0 : 12.5% 2 : 3 : 1 : 1 : 25% 3 : 2 : 1 : 1 : 25% We round sizes 4,5, 6 and 7 up to an allocation of size 8. Thus the max overhead is 3.1k -> 8k => 166% which wastes over 4k of physical memory. Note also that we incur at least 12.5% overhead. It's debatable whether 8K or 16K is better. Both have worst cases of 166%, but the 16K wastes more total physical memory. Compare this to the following table of power-2 allocations for dq-caches: virtual page size = 1k, slab header size is approx 28 bytes 8 : 124 : 4 : 4 : 3.2% 12 : 83 : 2 : 0 : 2.811% 16 : 62 : 2 : 4 : 3.2% 24 : 41 : 1 : 12 : 4% 32 : 31 : 1 : 4 : 3.2% 48 : 20 : 1 : 36 : 6.6% 64 : 15 : 1 : 36 : 6.6% 96 : 10 : 0 : 64 : 6.6% 128 : 7 : 1 : 100 : 14.2% 192 : 5 : 0 : 36 : 6.6% 256 : 3 : 1 : 228 : 33% 384 : 2 : 0 : 228 : 33% BUT, these first tier dq-caches all need 1k allocations from vmem, which passes the request off to the q-caches, which incure an additional 12.5% overhead for 8K page or 6.66% overhead for 16K pages. Additionally note that the overhead can be reduced somewhat if we could use an external header for slabs of object-size >= .5 vpage. But unfortunately, the first object of this slab falls on a page boundry, and thus will be inappropriatedly interpreted by vmem_size and vmem_free. Note, once again, that explicit object caches do not use vmem directly and are thus shielded from this page/ vpage limitation. This only applies to the internal dq-cache and the debated q-cache, not the explicitly allocated object-caches. More important than slack space is the shot-gun effect, which is exacerbated since we now have 7 layers in which a freed memory object may reside: 8B dq-cache magazine (1 of 20; assuming max-stack-size = 5, data/prev_data are both full, and there are two threads) 8B dq-cache depot ( 1 of 50; assuming max-stacks = 5 ) 8B slab ( 1 of 124 ) 1024B q-cache magazine ( 1 of 8; assuming max-stack-size = 2, data/prev_data are both full and there are two threads) 1024B q-cache depot ( 1 of 50 ) 1024B slab ( 1 of 7 ) vmem segment ( 1 of 255 ) (segment size = 255 * 8 * 1kvpage = 2 Meg) That's 347 million non-vmem hiding places for a free'd object. A reclaim currently is only slated to free depot's, but reduces us to 3.47 million hiding places. If we do away with the second tier (and the vpage concept), we have 124 thousand non-vmem hiding places. With reclaim, we have just over 2 thousand (since these are approx numbers). Note that this doesn't take into account the number of allocated slabs for the given depot, which is dependant on both the number of allocations, and the probability of slack due to shotgun (which thus means it's a recursive probability). If magazine-caches were lockable, then for a full free, we'd have 100% probability of freeing all objects back to vmem and thus to the OS (unless the user dynamically allocated depots, slabs and caches, which violates our 100% freeing assumption). This MIGHT, however, be construed as a good thing (tm) as follows. While it is important to regularly free memory back to the vmem paging system so that different data-type-sizes can reuse each other's memory space, each page-size freed to vmem requires hash-lookups and consolidation, and possibly even an OS call. One thing I noticed between GNU's malloc and perl5's default malloc was that GNU's consolidation time was a killer when millions of small allocs were performed. In one memory intensive app I wrote (for the XML compression contest) a full 50% or more of the run time (10+ minutes) was dedicated to consolidation. I know this because when I performed an immediate exit at the end of execution instead of letting perl deallocate everything, it took half as long. When I replaced GNU's malloc with that of perl5, there was almost zero post "return" time. Perl5 deallocated each and every object via reference decimation, but even this took almost no time.. Consolidation is the only real difference between the two algorithms. I believe one element of this consolidation was the insertion sort used in the free lists which allows for a faster best-fit algorithms (at the obvious expense of free times). In comparison, I utilize best-fit more memory-efficiently than SUNs proposed method (of instant-fit, which uses the first free in a power-of-2 free-chain) AND faster (constant-bounded-time: max=255 link lookups on allocs, with constant-time on deallocs) than GNUs linked-list lookup (O(n) due to unbounded size n). Thus my algorithm should be nearly as fast as SUNs for allocs / frees (which is a little simpler), but I've removed many of the restrictions and memory innefficiencies with only marginal extra performance cost. At the time of the writing of SUNs paper, they recognized that a direct user-space implementation of vmem was problematic (4B unit-page-size, no easy access to cpu-id as within the kernel, and each process must initialize the memory manager as opposed to only once per boot w/in the kernel). I believe I've alleviated the 4B page problem (by converting q-cache to dq-cache), alleviated the cpu-id problem by making unlocked magazine-caches per-thread instead of per-CPU (though this adds extra thread-creation time overhead, as well as increases the number of magazine locations where free'd objects may hide (avoiding heterogenious reuse)), and most importantly, found a balance between external tagging management and memory overhead by avoiding the proliferation of mini-data-structures. One admitted problem by SUN was that if your memory was full and you perform a free, you bomb-out when trying to allocate a new magazine to store the free'd object. Here except for user-object-caches, we avoid the problem. Lastly as for process-init overhead ( calling vmem_init() ), only the following need occur: init seg-hash (128 trivial cells), mm-hash (128 trivial cells), free-lists (256 trivial cells), init the link-list pointers of: available-free-lists, depot_list, slab_list create depots for each dq-cache (which in turn creates slabs and thread-specific-data-keys for thread-specific magazine-caches). A destructor is associated with each depot-key so as to remove the TSD magazine cache on thread-exit. create magazine-caches for each depot (using global memory) (additionally stores in TSD for the associated depot). For each new thread The first time it accesses a depot, a new TSD based magazine-cache is created (which is reclaimed on thread-exit). -Michael