Re: slab allocator performance issues
On Tue, 20 Dec 2022 at 21:19, John Naylor wrote: > > > On Tue, Dec 20, 2022 at 10:36 AM David Rowley wrote: > > > > I'm planning on pushing the attached v3 patch shortly. I've spent > > several days reading over this and testing it in detail along with > > adding additional features to the SlabCheck code to find more > > inconsistencies. > > FWIW, I reran my test from last week and got similar results. Thanks a lot for testing that stuff last week. I got a bit engrossed in the perf weirdness and forgot to reply. I found they made much more sense after using palloc0 and touching the allocated memory just before freeing. I think this is a more realistic test. I've now pushed the patch after making a small adjustment to the version I sent earlier. David
Re: slab allocator performance issues
On Tue, Dec 20, 2022 at 10:36 AM David Rowley wrote: > > I'm planning on pushing the attached v3 patch shortly. I've spent > several days reading over this and testing it in detail along with > adding additional features to the SlabCheck code to find more > inconsistencies. FWIW, I reran my test from last week and got similar results. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Tue, Dec 13, 2022 at 7:50 AM David Rowley wrote: > > Thanks for testing the patch. > > On Mon, 12 Dec 2022 at 20:14, John Naylor wrote: > > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. > > I've tried to reproduce this with the v13 patches applied and I'm not > really getting the same as you are. To run the function 100 times I > used: > > select x, a.* from generate_series(1,100) x(x), lateral (select * from > bench_load_random_int(500 * 1000 * (1+x-x))) a; Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time: 500k ints: v13-0001-0005 average of 30: 217ms 47.61% postgres postgres [.] rt_set 20.99% postgres postgres [.] SlabAlloc 10.00% postgres postgres [.] rt_node_insert_inner.isra.0 6.87% postgres [unknown][k] 0xbce011b7 3.53% postgres postgres [.] MemoryContextAlloc 2.82% postgres postgres [.] SlabFree +slab v4 average of 30: 206ms 51.13% postgres postgres [.] rt_set 14.08% postgres postgres [.] SlabAlloc 11.41% postgres postgres [.] rt_node_insert_inner.isra.0 7.44% postgres [unknown][k] 0xbce011b7 3.89% postgres postgres [.] MemoryContextAlloc 3.39% postgres postgres [.] SlabFree It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments: perf stat -p $pid sleep 2 v13: 2,001.55 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 311,690 page-faults:u# 155.724 K/sec 3,128,740,701 cycles:u #1.563 GHz 4,739,333,861 instructions:u #1.51 insn per cycle 820,014,588 branches:u # 409.690 M/sec 7,385,923 branch-misses:u #0.90% of all branches +slab v4: 2,001.09 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 326,017 page-faults:u# 162.920 K/sec 3,016,668,818 cycles:u #1.508 GHz 4,324,863,908 instructions:u #1.43 insn per cycle 761,839,927 branches:u # 380.712 M/sec 7,718,366 branch-misses:u #1.01% of all branches perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2 min/max of 3 runs: v13: LL cache misses: 25.08% - 25.41% +slab v4: LL cache misses: 25.74% - 26.01% -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
Thanks for testing the patch. On Mon, 12 Dec 2022 at 20:14, John Naylor wrote: > v13-0001 to 0005: > 2.60% postgres postgres [.] SlabFree > + v4 slab: >4.98% postgres postgres [.] SlabFree > > While allocation is markedly improved, freeing looks worse here. The > proportion is surprising because only about 2% of nodes are freed during the > load, but doing that takes up 10-40% of the time compared to allocating. I've tried to reproduce this with the v13 patches applied and I'm not really getting the same as you are. To run the function 100 times I used: select x, a.* from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a; (I had to add the * (1+x-x) to add a lateral dependency to stop the function just being executed once) v13-0001 - 0005 gives me: 37.71% postgres [.] rt_set 19.24% postgres [.] SlabAlloc 8.73% [kernel] [k] clear_page_rep 5.21% postgres [.] rt_node_insert_inner.isra.0 2.63% [kernel] [k] asm_exc_page_fault 2.24% postgres [.] SlabFree and fairly consistently 122 ms runtime per call. Applying v4 slab patch I get: 41.06% postgres [.] rt_set 10.84% postgres [.] SlabAlloc 9.01% [kernel] [k] clear_page_rep 6.49% postgres [.] rt_node_insert_inner.isra.0 2.76% postgres [.] SlabFree and fairly consistently 112 ms per call. I wonder if you can consistently get the same result on another compiler or after patching something like master~50 or master~100. Maybe it's just a code alignment thing. Looking at the annotation of perf report for SlabFree with the patched version I see: │ │ /* push this chunk onto the head of the free list */ │ *(MemoryChunk **) pointer = block->freehead; 0.09 │ mov 0x10(%r8),%rax │ slab = block->slab; 59.15 │ mov (%r8),%rbp │ *(MemoryChunk **) pointer = block->freehead; 9.43 │ mov %rax,(%rdi) │ block->freehead = chunk; │ │ block->nfree++; I think what that's telling me is that dereferencing the block's memory is slow, likely due to that particular cache line not being cached any longer. I tried running the test with 10,000 ints instead of 500,000 so that there would be less CPU cache pressure. I see: 29.76 │ mov (%r8),%rbp │ *(MemoryChunk **) pointer = block->freehead; 12.72 │ mov %rax,(%rdi) │ block->freehead = chunk; │ │ block->nfree++; │ mov 0x8(%r8),%eax │ block->freehead = chunk; 4.27 │ mov %rdx,0x10(%r8) │ SlabBlocklistIndex(): │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift; │ mov $0x1,%edx │ SlabFree(): │ block->nfree++; │ lea 0x1(%rax),%edi │ mov %edi,0x8(%r8) │ SlabBlocklistIndex(): │ int32 blocklist_shift = slab->blocklist_shift; │ mov 0x70(%rbp),%ecx │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift; 8.46 │ shl %cl,%edx various other instructions in SlabFree are proportionally taking longer now. For example the bitshift at the end was insignificant previously. That indicates to me that this is due to caching effects. We must fetch the block in SlabFree() in both versions. It's possible that something is going on in SlabAlloc() that is causing more useful cachelines to be evicted, but (I think) one of primary design goals Andres was going for was to reduce that. For example not having to write out the freelist for an entire block when the block is first allocated means not having to load possibly all cache lines for the entire block anymore. I tried looking at perf stat during the run. Without slab changes: drowley@amd3990x:~$ sudo perf stat --pid=74922 sleep 2 Performance counter stats for process id '74922': 2,000.74 msec task-clock#1.000 CPUs utilized 4 context-switches #1.999 /sec 0 cpu-migrations#0.000 /sec 578,139 page-faults # 288.963 K/sec 8,614,687,392 cycles#4.306 GHz (83.21%) 682,574,688 stalled-cycles-frontend #7.92% frontend cycles idle (83.33%) 4,822,904,271 stalled-cycles-backend# 55.98% backend cycles idle (83.41%) 11,447,124,105 instructions #1.33 insn per cycle #0.42 stalled cycles per insn (83.41%) 1,947,647,575 branches # 973.464 M/sec (83.41%) 13,914,897 branch-misses #0.71% of all branches
Re: slab allocator performance issues
On Sat, Dec 10, 2022 at 11:02 AM David Rowley wrote: > [v4] Thanks for working on this! I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but should be useful enough for testing allocation speed), only applying the first five, which are local-memory only. The benchmark is not meant to represent a realistic workload, and primarily stresses traversal and allocation of the smallest node type. Minimum of five, with turbo-boost off, on recent Intel laptop hardware: v13-0001 to 0005: # select * from bench_load_random_int(500 * 1000); mem_allocated | load_ms ---+- 151123432 | 222 47.06% postgres postgres [.] rt_set 22.89% postgres postgres [.] SlabAlloc 9.65% postgres postgres [.] rt_node_insert_inner.isra.0 5.94% postgres [unknown][k] 0xb5e011b7 3.62% postgres postgres [.] MemoryContextAlloc 2.70% postgres libc.so.6[.] __memmove_avx_unaligned_erms 2.60% postgres postgres [.] SlabFree + v4 slab: # select * from bench_load_random_int(500 * 1000); mem_allocated | load_ms ---+- 152463112 | 213 52.42% postgres postgres [.] rt_set 12.80% postgres postgres [.] SlabAlloc 9.38% postgres postgres [.] rt_node_insert_inner.isra.0 7.87% postgres [unknown] [k] 0xb5e011b7 4.98% postgres postgres [.] SlabFree While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. num_keys = 50, height = 7 n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257 Sidenote: I don't recall ever seeing vsyscall (I think that's what the 0xb5e011b7 address is referring to) in a profile, so not sure what is happening there. [1] https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
.On Mon, 5 Dec 2022 at 23:18, John Naylor wrote: > > > On Mon, Dec 5, 2022 at 3:02 PM David Rowley wrote: > > Going by [2], the instructions are very different with each method, so > > other machines with different latencies on those instructions might > > show something different. I attached what I used to test if anyone > > else wants a go. > > I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 > instructions with low latency. The later ones need the prior results to > execute, which I think is what the XXX comment "isn't great" was referring > to. The new coding is more mysterious (does it do the right thing on all > platforms?), so I guess the original is still the way to go unless we get a > better idea. I don't think it would work well on a one's complement machine, but I don't think we support those going by the comments above RIGHTMOST_ONE in bitmapset.c. In anycase, I found that it wasn't any faster than what Andres wrote. In fact, even changing the code there to "index = (freecount > 0);" seems to do very little to increase performance. I do see that having 3 freelist items performs a decent amount better than having 9. However, which workload is run may alter the result of that, assuming that keeping new allocations on fuller blocks is a winning strategy for the CPU's caches. I've now done quite a bit more work on Andres' patch to try and get it into (hopefully) somewhere close to a committable shape. I'm fairly happy with what's there now. It does seem to perform much better than current master, especially so when the workload would have caused master to continually malloc and free an entire block when freeing and allocating a single chunk. I've done some basic benchmarking, mostly using the attached alloc_bench patch. If I run: select *, round(slab_result / aset_result * 100 - 100,1)::text || '%' as slab_slower_by from ( select chunk_size, keep_chunks, sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks, 1024*1024*1024, 'slab'))::numeric(1000,3) as slab_result, sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks, 1024*1024*1024, 'aset'))::numeric(1000,3) as aset_result from (values(1),(10),(50),(100),(200),(300),(400),(500),(1000),(2000),(3000),(4000),(5000),(1)) v1(keep_chunks), (values(64),(128),(256),(512),(1024)) v2(chunk_size) group by rollup(1,2) ); The results for the 64-byte chunk are shown in the attached chart. It's not as fast as aset, but much faster than master's slab.c The first blue bar of the chart is well above the vertical axis. It took master 1118958 milliseconds for that test. The attached patch took 194 ms. The rest of the tests seem to put the patched code around somewhere in the middle between the unpatched code and aset.c's performance. The benchmark I did was entirely a FIFO workload that keeps around "keep_chunks" at once before starting to free the oldest chunks. I know Tomas has some LIFO and random benchmarks. I edited that code [1] a little to add support for other context types so that a comparison could be done more easily, however, I'm getting very weird performance results where sometimes it runs about twice as fast (Tomas mentioned he got this too). I'm not seeing that with my own benchmarking function, so I'm wondering if there's something weird going on with the benchmark itself rather than the slab.c code. I've likely made much more changes than I can list here, but here are a few of the more major ones: 1. Make use of dclist for empty blocks 2. In SlabAlloc() allocate chunks from the freelist before the unused list. 3. Added output for showing information about empty blocks in the SlabStats output. 4. Renamed the context freelists to blocklist. I found this was likely just confusing things with the block-level freelist. In any case, it seemed weird to have freelist[0] store full blocks. Not much free there! I renamed to blocklist[]. 5. I did a round of changing the order of the fields in SlabBlock. This seems to affect performance quite a bit. Having the context first seems to improve performance. Having the blocklist[] node last also helps. 6. Removed nblocks and headerSize from SlabContext. headerSize is no longer needed. nblocks was only really used for Asserts and SlabIsEmpty. I changed the Asserts to use a local count of blocks and changed SlabIsEmpty to look at the context's mem_allocated. 7. There's now no integer division in any of the alloc and free code. The only time we divide by fullChunkSize is in the check function. 8. When using a block from the emptyblock list, I changed the code to not re-init the block. It's now used as it was left previously. This means no longer having to build the freelist again. 9. Updated all comments to reflect the current state of the code. Some things I thought about but didn't do: a. Change the size of SlabContext's chunkSize, fullChunkSize and blockSize to be uint32 instead of Size. It might be possible
Re: slab allocator performance issues
On Fri, Sep 10, 2021 at 5:07 PM Tomas Vondra wrote: > Turns out it's pretty difficult to benchmark this, because the results > strongly depend on what the backend did before. What you report here seems to be mostly cold-cache effects, with which I don't think we need to be overly concerned. We don't want big regressions in the cold-cache case, but there is always going to be some overhead when a new backend starts up, because you've got to fault some pages into the heap/malloc arena/whatever before they can be efficiently accessed. What would be more concerning is if we found out that the performance depended heavily on the internal state of the allocator. For example, suppose you have two warmup routines W1 and W2, each of which touches the same amount of total memory, but with different details. Then you have a benchmark B. If you do W1-B and W2-B and the time for B varies dramatically between them, then you've maybe got an issue. For instance, it could indicate that the allocator has issue when the old and new allocations are very different sizes, or something like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Mon, Dec 5, 2022 at 3:02 PM David Rowley wrote: > > On Fri, 11 Nov 2022 at 22:20, John Naylor wrote: > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); > > Doesn't this create a sort of round-robin use of the free list? What > we want is a sort of "histogram" bucket set of free lists so we can > group together blocks that have a close-enough free number of chunks. > Unless I'm mistaken, I think what you have doesn't do that. The intent must have slipped my mind along the way. > I wondered if simply: > > index = -(-freecount >> slab->freelist_shift); > > would be faster than Andres' version. I tried it out and on my AMD > machine, it's about the same speed. Same on a Raspberry Pi 4. > > Going by [2], the instructions are very different with each method, so > other machines with different latencies on those instructions might > show something different. I attached what I used to test if anyone > else wants a go. I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Fri, 11 Nov 2022 at 22:20, John Naylor wrote: > > > On Wed, Oct 12, 2022 at 4:37 PM David Rowley wrote: > > [v3] > > + /* > + * Compute a shift that guarantees that shifting chunksPerBlock with it > + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for > full blocks). > + */ > + slab->freelist_shift = 0; > + while ((slab->chunksPerBlock >> slab->freelist_shift) >= > (SLAB_FREELIST_COUNT - 1)) > + slab->freelist_shift++; > > + /* > + * Ensure, without a branch, that index 0 is only used for blocks entirely > + * without free chunks. > + * XXX: There probably is a cheaper way to do this. Needing to shift twice > + * by slab->freelist_shift isn't great. > + */ > + index = (freecount + (1 << slab->freelist_shift) - 1) >> > slab->freelist_shift; > > How about something like > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); Doesn't this create a sort of round-robin use of the free list? What we want is a sort of "histogram" bucket set of free lists so we can group together blocks that have a close-enough free number of chunks. Unless I'm mistaken, I think what you have doesn't do that. I wondered if simply: index = -(-freecount >> slab->freelist_shift); would be faster than Andres' version. I tried it out and on my AMD machine, it's about the same speed. Same on a Raspberry Pi 4. Going by [2], the instructions are very different with each method, so other machines with different latencies on those instructions might show something different. I attached what I used to test if anyone else wants a go. AMD Zen2 $ ./freecount 20 Test 'a' in 0.922766 seconds Test 'd' in 0.922762 seconds (0.000433% faster) RPI4 $ ./freecount 20 Test 'a' in 3.341350 seconds Test 'd' in 3.338690 seconds (0.079672% faster) That was gcc. Trying it with clang, it went in a little heavy-handed and optimized out my loop, so some more trickery might be needed for a useful test on that compiler. David [2] https://godbolt.org/z/dh95TohEG #include #include #include #define SLAB_FREELIST_COUNT 9 int main(int argc, char **argv) { clock_t start, end; double v1_time, v2_time; int freecount; int freelist_shift = 0; int chunksPerBlock; int index; int zerocount = 0; if (argc < 2) { printf("Syntax: %s \n", argv[0]); return -1; } chunksPerBlock = atoi(argv[1]); printf("chunksPerBlock = %d\n", chunksPerBlock); while ((chunksPerBlock >> freelist_shift) >= (SLAB_FREELIST_COUNT - 1)) freelist_shift++; printf("freelist_shift = %d\n", freelist_shift); start = clock(); for (freecount = 0; freecount <= chunksPerBlock; freecount++) { index = (freecount + (1 << freelist_shift) - 1) >> freelist_shift; /* try to prevent optimizing the above out */ if (index == 0) zerocount++; } end = clock(); v1_time = (double) (end - start) / CLOCKS_PER_SEC; printf("Test 'a' in %f seconds\n", v1_time); printf("zerocount = %d\n", zerocount); zerocount = 0; start = clock(); for (freecount = 0; freecount <= chunksPerBlock; freecount++) { index = -(-freecount >> freelist_shift); /* try to prevent optimizing the above out */ if (index == 0) zerocount++; } end = clock(); v2_time = (double) (end - start) / CLOCKS_PER_SEC; printf("Test 'd' in %f seconds (%f%% faster)\n", v2_time, v1_time / v2_time * 100.0 - 100.0); printf("zerocount = %d\n", zerocount); return 0; }
Re: slab allocator performance issues
On Wed, Oct 12, 2022 at 4:37 PM David Rowley wrote: > [v3] + /* + * Compute a shift that guarantees that shifting chunksPerBlock with it + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks). + */ + slab->freelist_shift = 0; + while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1)) + slab->freelist_shift++; + /* + * Ensure, without a branch, that index 0 is only used for blocks entirely + * without free chunks. + * XXX: There probably is a cheaper way to do this. Needing to shift twice + * by slab->freelist_shift isn't great. + */ + index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift; How about something like #define SLAB_FREELIST_COUNT ((1<<3) + 1) index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); and dispense with both freelist_shift and the loop that computes it? -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Sat, 11 Sept 2021 at 09:07, Tomas Vondra wrote: > I've been investigating the regressions in some of the benchmark > results, together with the generation context benchmarks [1]. I've not looked into the regression you found with this yet, but I did rebase the patch. slab.c has seen quite a number of changes recently. I didn't spend a lot of time checking over the patch. I mainly wanted to see what the performance was like before reviewing in too much detail. To test the performance, I used [1] and ran: select pg_allocate_memory_test(, 1024*1024, 10::bigint*1024*1024*1024, 'slab'); that basically allocates chunks of and keeps around 1MB of them at a time and allocates a total of 10GBs of them. I saw: Master: 16 byte chunk = 8754.678 ms 32 byte chunk = 4511.725 ms 64 byte chunk = 2244.885 ms 128 byte chunk = 1135.349 ms 256 byte chunk = 548.030 ms 512 byte chunk = 272.017 ms 1024 byte chunk = 144.618 ms Master + attached patch: 16 byte chunk = 5255.974 ms 32 byte chunk = 2640.807 ms 64 byte chunk = 1328.949 ms 128 byte chunk = 668.078 ms 256 byte chunk = 330.564 ms 512 byte chunk = 166.844 ms 1024 byte chunk = 85.399 ms So patched runs in about 60% of the time that master runs in. I plan to look at the patch in a bit more detail and see if I can recreate and figure out the regression that Tomas reported. For now, I just want to share the rebased patch. The only thing I really adjusted from Andres' version is to instead of using pointers for the linked list block freelist, I made it store the number of bytes into the block that the chunk is. This means we can use 4 bytes instead of 8 bytes for these pointers. The block size is limited to 1GB now anyway, so 32-bit is large enough for these offsets. David [1] https://www.postgresql.org/message-id/attachment/137056/allocate_performance_functions.patch.txt From 2b89d993b0294d5c0fe0a8333bccc555337ac979 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 12 Oct 2022 09:30:24 +1300 Subject: [PATCH v3] WIP: slab performance. Author: Andres Freund Discussion: https://postgr.es/m/20210717194333.mr5io3zup3kxa...@alap3.anarazel.de --- src/backend/utils/mmgr/slab.c | 449 ++ 1 file changed, 294 insertions(+), 155 deletions(-) diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c index 1a0b28f9ea..58b8e2d67c 100644 --- a/src/backend/utils/mmgr/slab.c +++ b/src/backend/utils/mmgr/slab.c @@ -23,15 +23,18 @@ * global (context) level. This is possible as the chunk size (and thus also * the number of chunks per block) is fixed. * - * On each block, free chunks are tracked in a simple linked list. Contents - * of free chunks is replaced with an index of the next free chunk, forming - * a very simple linked list. Each block also contains a counter of free - * chunks. Combined with the local block-level freelist, it makes it trivial - * to eventually free the whole block. + * On each block, never allocated chunks are tracked by a simple offset, and + * free chunks are tracked in a simple linked list. The offset approach + * avoids needing to iterate over all chunks when allocating a new block, + * which would cause page faults and cache pollution. Contents of free chunks + * is replaced with a the offset (in bytes) from the block pointer to the + * next free chunk, forming a very simple linked list. Each block also + * contains a counter of free chunks. Combined with the local block-level + * freelist, it makes it trivial to eventually free the whole block. * * At the context level, we use 'freelist' to track blocks ordered by number * of free chunks, starting with blocks having a single allocated chunk, and - * with completely full blocks on the tail. + * with completely full blocks on the tail. XXX * * This also allows various optimizations - for example when searching for * free chunk, the allocator reuses space from the fullest blocks first, in @@ -44,7 +47,7 @@ * case this performs as if the pointer was not maintained. * * We cache the freelist index for the blocks with the fewest free chunks - * (minFreeChunks), so that we don't have to search the freelist on every + * (minFreeChunkIndex), so that we don't have to search the freelist on every * SlabAlloc() call, which is quite expensive. * *- @@ -60,6 +63,17 @@ #define Slab_BLOCKHDRSZMAXALIGN(sizeof(SlabBlock)) +struct SlabBlock; +struct SlabChunk; + +/* + * Number of actual freelists + 1, for full blocks. Full blocks are always at + * offset 0. + */ +#define SLAB_FREELIST_COUNT 9 + +#define SLAB_RETAIN_EMPTY_BLOCK_COUNT 10 + /* * SlabContext is a specialized implementation of MemoryContext. */ @@ -72,13 +86,16 @@ typedef struct SlabContext SizeblockSize; /* block size */
Re: slab allocator performance issues
Hi, I've been investigating the regressions in some of the benchmark results, together with the generation context benchmarks [1]. Turns out it's pretty difficult to benchmark this, because the results strongly depend on what the backend did before. For example if I run slab_bench_fifo with the "decreasing" test for 32kB blocks and 512B chunks, I get this: select * from slab_bench_fifo(100, 32768, 512, 100, 1, 5000); mem_allocated | alloc_ms | free_ms ---+--+- 528547840 | 155394 | 87440 i.e. palloc() takes ~155ms and pfree() ~87ms (and these result are stable, the numbers don't change much with more runs). But if I run a set of "lifo" tests in the backend first, the results look like this: mem_allocated | alloc_ms | free_ms ---+--+- 528547840 |41728 | 71524 (1 row) so the pallocs are suddenly about ~4x faster. Clearly, what the backend did before may have pretty dramatic impact on results, even for simple benchmarks like this. Note: The benchmark was a single SQL script, running all the different workloads in the same backend. I did a fair amount of perf profiling, and the main difference between the slow and fast runs seems to be this: 0 page-faults:u 0 minor-faults:u 0 major-faults:u vs 20,634,153 page-faults:u 20,634,153 minor-faults:u 0 major-faults:u Attached is a more complete perf stat output, but the page faults seem to be the main issue. My theory is that in the "fast" case, the past backend activity puts the glibc memory management into a state that prevents page faults in the benchmark. But of course, this theory may be incomplete - for example it's not clear why running the benchmark repeatedly would not "condition" the backend the same way. But it doesn't - it's ~150ms even for repeated runs. Secondly, I'm not sure this explains why some of the timings actually got much slower with the 0003 patch, when the sequence of the steps is still the same. Of course, it's possible 0003 changes the allocation pattern a bit, interfering with glibc memory management. This leads to a couple of interesting questions, I think: 1) I've only tested this on Linux, with glibc. I wonder how it'd behave on other platforms, or with other allocators. 2) Which cases are more important? When the backend was warmed up, or when each benchmark runs in a new backend? It seems the "new backend" is something like a "worst case" leading to more page faults, so maybe that's the thing to watch. OTOH it's unlikely to have a completely new backend, so maybe not. 3) Can this teach us something about how to allocate stuff, to better "prepare" the backend for future allocations? For example, it's a bit strange that repeated runs of the same benchmark don't do the trick, for some reason. regards [1] https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Performance counter stats for process id '11829': 219,869,739,737 cycles:u 119,722,280,436 instructions:u#0.54 insn per cycle #1.56 stalled cycles per insn 7,784,566,858 cache-references:u 2,487,257,287 cache-misses:u# 31.951 % of all cache refs 5,942,054,520 bus-cycles:u 0 page-faults:u 0 minor-faults:u 0 major-faults:u 187,181,661,719 stalled-cycles-frontend:u # 85.13% frontend cycles idle 144,274,017,071 stalled-cycles-backend:u # 65.62% backend cycles idle 60.000876248 seconds time elapsed Performance counter stats for process id '11886': 145,093,090,692 cycles:u 74,986,543,212 instructions:u#0.52 insn per cycle #2.63 stalled cycles per insn 4,753,764,781 cache-references:u 1,342,653,549 cache-misses:u# 28.244 % of all cache refs 3,925,175,515 bus-cycles:u 20,634,153 page-faults:u 20,634,153 minor-faults:u
Re: slab allocator performance issues
On 8/1/21 11:07 PM, Andres Freund wrote: Hi, On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote: In the attached .ods file with results, the "comparison" sheets are the interesting ones - the last couple columns compare the main metrics for the two patches (labeled patch-1 and patch-2) to master. I assume with patch-1/2 you mean the ones after the benchmark patch itself? Yes, those are the two WIP patches you shared on 19/7. Overall, the results look quite good - patch-1 is mostly on par with master, with maybe 5% variability in both directions. That's expected, considering the patch does not aim to improve performance. Not for slab anyway... Maybe the hot/cold separation could have some effect, but probably not for the workloads I've tested. The second patch brings some nice improvements - 30%-50% in most cases (for both allocation and free) seems pretty nice. But for the "increase" FIFO pattern (incrementally allocating/freeing more memory) there's a significant regression - particularly for the allocation time. In some cases (larger chunks, block size does not matter too much) it jumps from 25ms to almost 200ms. I'm not surprised to see some memory usage increase some, but that degree of time overhead does surprise me. ISTM there's something wrong. Yeah, the higher amount of allocated memory is due to the couple fields added to the SlabBlock struct, but even that only affects a single case with 480B chunks and 1kB blocks. Seems fine to me, especially if we end up growing the slab blocks. Not sure about the allocation time, though. It'd probably worth benchmarking the different improvements inside the WIP: slab performance. patch. There's some that I'd expect to be all around improvements, whereas others likely aren't quite that clear cut. I assume you'd prefer that I split the patch up? Yeah, if you split that patch into sensible parts, I'll benchmark those. Also, we can add more interesting workloads if you have some ideas. This seems unfortunate - the allocation pattern (FIFO, allocating more memory over time) seems pretty common, and the slowdown is significant. Did you analyze what causes the regressions? No, not yet. I'll run the same set of benchmarks for the Generation, discussed in the other thread, and then I'll investigate this. But if you split the patch, that'd probably help. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: slab allocator performance issues
Hi, On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote: > In the attached .ods file with results, the "comparison" sheets are the > interesting ones - the last couple columns compare the main metrics for the > two patches (labeled patch-1 and patch-2) to master. I assume with patch-1/2 you mean the ones after the benchmark patch itself? > Overall, the results look quite good - patch-1 is mostly on par with master, > with maybe 5% variability in both directions. That's expected, considering > the patch does not aim to improve performance. Not for slab anyway... > The second patch brings some nice improvements - 30%-50% in most cases (for > both allocation and free) seems pretty nice. But for the "increase" FIFO > pattern (incrementally allocating/freeing more memory) there's a significant > regression - particularly for the allocation time. In some cases (larger > chunks, block size does not matter too much) it jumps from 25ms to almost > 200ms. I'm not surprised to see some memory usage increase some, but that degree of time overhead does surprise me. ISTM there's something wrong. It'd probably worth benchmarking the different improvements inside the WIP: slab performance. patch. There's some that I'd expect to be all around improvements, whereas others likely aren't quite that clear cut. I assume you'd prefer that I split the patch up? > This seems unfortunate - the allocation pattern (FIFO, allocating more > memory over time) seems pretty common, and the slowdown is significant. Did you analyze what causes the regressions? Greetings, Andres Freund
Re: slab allocator performance issues
Em ter., 20 de jul. de 2021 às 11:15, David Rowley escreveu: > On Tue, 20 Jul 2021 at 10:04, Ranier Vilela wrote: > > Perhaps you would agree with me that in the most absolute of times, > malloc will not fail. > > So it makes more sense to test: > > if (ret != NULL) > > than > > if (ret == NULL) > > I think it'd be better to use unlikely() for that. > Sure, it can be, but in this case, there is no way to reduce the scope. regards, Ranier Vilela
Re: slab allocator performance issues
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela wrote: > Perhaps you would agree with me that in the most absolute of times, malloc > will not fail. > So it makes more sense to test: > if (ret != NULL) > than > if (ret == NULL) I think it'd be better to use unlikely() for that. David
Re: slab allocator performance issues
Em seg., 19 de jul. de 2021 às 17:56, Andres Freund escreveu: > Hi, > > On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote: > > Sounds great! Thanks for investigating this and for the improvements. > > > > It might be good to do some experiments to see how the changes affect > memory > > consumption for practical workloads. I'm willing to spend soem time on > that, > > if needed. > > I've attached my changes. They're in a rough shape right now, but I > think good enough for an initial look. > Hi Andres, I take a look. Perhaps you would agree with me that in the most absolute of times, malloc will not fail. So it makes more sense to test: if (ret != NULL) than if (ret == NULL) What might help branch prediction. With this change wins too, the possibility to reduce the scope of some variable. Example: +static void * pg_noinline +AllocSetAllocLarge(AllocSet set, Size size, int flags) +{ + AllocBlock block; +Size chunk_size; +Size blksize; + + /* check size, only allocation path where the limits could be hit */ + MemoryContextCheckSize(>header, size, flags); + + AssertArg(AllocSetIsValid(set)); + + chunk_size = MAXALIGN(size); + blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; + block = (AllocBlock) malloc(blksize); + if (block != NULL) + { + AllocChunk chunk; + + set->header.mem_allocated += blksize; + + block->aset = set; + block->freeptr = block->endptr = ((char *) block) + blksize; + + /* + * Stick the new block underneath the active allocation block, if any, + * so that we don't lose the use of the space remaining therein. + */ + if (set->blocks != NULL) + { +block->prev = set->blocks; +block->next = set->blocks->next; +if (block->next) +block->next->prev = block; + set->blocks->next = block; +} +else +{ + block->prev = NULL; + block->next = NULL; + set->blocks = block; +} + +chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ); +chunk->size = chunk_size; + +return AllocSetAllocReturnChunk(set, size, chunk, chunk_size); + } + + return NULL; +} regards, Ranier Vilela
Re: slab allocator performance issues
Hi, On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote: > Sounds great! Thanks for investigating this and for the improvements. > > It might be good to do some experiments to see how the changes affect memory > consumption for practical workloads. I'm willing to spend soem time on that, > if needed. I've attached my changes. They're in a rough shape right now, but I think good enough for an initial look. Greetings, Andres Freund >From af4cd1f0b64cd52d7eab342493e3dfd6b0d8388e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 19 Jul 2021 12:55:51 -0700 Subject: [PATCH 1/2] WIP: optimize allocations by separating hot from cold paths. --- src/include/nodes/memnodes.h| 4 +- src/include/utils/memutils.h| 13 + src/backend/utils/mmgr/aset.c | 537 ++-- src/backend/utils/mmgr/generation.c | 22 +- src/backend/utils/mmgr/mcxt.c | 179 +++--- src/backend/utils/mmgr/slab.c | 14 +- 6 files changed, 354 insertions(+), 415 deletions(-) diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h index e6a757d6a07..8a42d2ff999 100644 --- a/src/include/nodes/memnodes.h +++ b/src/include/nodes/memnodes.h @@ -57,10 +57,10 @@ typedef void (*MemoryStatsPrintFunc) (MemoryContext context, void *passthru, typedef struct MemoryContextMethods { - void *(*alloc) (MemoryContext context, Size size); + void *(*alloc) (MemoryContext context, Size size, int flags); /* call this free_p in case someone #define's free() */ void (*free_p) (MemoryContext context, void *pointer); - void *(*realloc) (MemoryContext context, void *pointer, Size size); + void *(*realloc) (MemoryContext context, void *pointer, Size size, int flags); void (*reset) (MemoryContext context); void (*delete_context) (MemoryContext context); Size (*get_chunk_space) (MemoryContext context, void *pointer); diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h index ff872274d44..2f75b4cca46 100644 --- a/src/include/utils/memutils.h +++ b/src/include/utils/memutils.h @@ -147,6 +147,19 @@ extern void MemoryContextCreate(MemoryContext node, extern void HandleLogMemoryContextInterrupt(void); extern void ProcessLogMemoryContextInterrupt(void); +extern void *MemoryContextAllocationFailure(MemoryContext context, Size size, int flags); + +extern void MemoryContextSizeFailure(MemoryContext context, Size size, int flags) pg_attribute_noreturn(); + +static inline void +MemoryContextCheckSize(MemoryContext context, Size size, int flags) +{ + if (unlikely(!AllocSizeIsValid(size))) + { + if (!(flags & MCXT_ALLOC_HUGE) || !AllocHugeSizeIsValid(size)) + MemoryContextSizeFailure(context, size, flags); + } +} /* * Memory-context-type-specific functions diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 77872e77bcd..00878354392 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -263,9 +263,9 @@ static AllocSetFreeList context_freelists[2] = /* * These functions implement the MemoryContext API for AllocSet contexts. */ -static void *AllocSetAlloc(MemoryContext context, Size size); +static void *AllocSetAlloc(MemoryContext context, Size size, int flags); static void AllocSetFree(MemoryContext context, void *pointer); -static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); +static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size, int flags); static void AllocSetReset(MemoryContext context); static void AllocSetDelete(MemoryContext context); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); @@ -704,266 +704,10 @@ AllocSetDelete(MemoryContext context) free(set); } -/* - * AllocSetAlloc - * Returns pointer to allocated memory of given size or NULL if - * request could not be completed; memory is added to the set. - * - * No request may exceed: - * MAXALIGN_DOWN(SIZE_MAX) - ALLOC_BLOCKHDRSZ - ALLOC_CHUNKHDRSZ - * All callers use a much-lower limit. - * - * Note: when using valgrind, it doesn't matter how the returned allocation - * is marked, as mcxt.c will set it to UNDEFINED. In some paths we will - * return space that is marked NOACCESS - AllocSetRealloc has to beware! - */ -static void * -AllocSetAlloc(MemoryContext context, Size size) +static inline void * +AllocSetAllocReturnChunk(AllocSet set, Size size, AllocChunk chunk, Size chunk_size) { - AllocSet set = (AllocSet) context; - AllocBlock block; - AllocChunk chunk; - int fidx; - Size chunk_size; - Size blksize; - - AssertArg(AllocSetIsValid(set)); - - /* - * If requested size exceeds maximum for chunks, allocate an entire block - * for this request. - */ - if (size > set->allocChunkLimit) - { - chunk_size = MAXALIGN(size); - blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; - block = (AllocBlock) malloc(blksize); - if (block == NULL) - return NULL; - - context->mem_allocated += blksize; - - block->aset = set; -
Re: slab allocator performance issues
On 7/18/21 3:06 AM, Andres Freund wrote: Hi, On 2021-07-17 16:10:19 -0700, Andres Freund wrote: Instead of populating a linked list with all chunks upon creation of a block - which requires touching a fair bit of memory - keep a per-block pointer (or an offset) into "unused" area of the block. When allocating from the block and theres still "unused" memory left, use that, instead of bothering with the freelist. I tried that, and it nearly got slab up to the allocation/freeing performance of aset.c (while winning after allocation, due to the higher memory density). Combining that with limiting the number of freelists, and some microoptimizations, allocation performance is now on par. Freeing still seems to be a tad slower, mostly because SlabFree() practically is immediately stalled on fetching the block, whereas AllocSetFree() can happily speculate ahead and do work like computing the freelist index. And then aset only needs to access memory inside the context - which is much more likely to be in cache than a freelist inside a block (there are many more). But that's ok, I think. It's close and it's only a small share of the overall runtime of my workload... Sounds great! Thanks for investigating this and for the improvements. It might be good to do some experiments to see how the changes affect memory consumption for practical workloads. I'm willing to spend soem time on that, if needed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: slab allocator performance issues
Hi, On 2021-07-17 16:10:19 -0700, Andres Freund wrote: > Instead of populating a linked list with all chunks upon creation of a block - > which requires touching a fair bit of memory - keep a per-block pointer (or an > offset) into "unused" area of the block. When allocating from the block and > theres still "unused" memory left, use that, instead of bothering with the > freelist. > > I tried that, and it nearly got slab up to the allocation/freeing performance > of aset.c (while winning after allocation, due to the higher memory density). Combining that with limiting the number of freelists, and some microoptimizations, allocation performance is now on par. Freeing still seems to be a tad slower, mostly because SlabFree() practically is immediately stalled on fetching the block, whereas AllocSetFree() can happily speculate ahead and do work like computing the freelist index. And then aset only needs to access memory inside the context - which is much more likely to be in cache than a freelist inside a block (there are many more). But that's ok, I think. It's close and it's only a small share of the overall runtime of my workload... Greetings, Andres Freund
Re: slab allocator performance issues
Hi, On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote: > On 7/17/21 11:14 PM, Andres Freund wrote: > > Hm. I wonder if we should just not populate the freelist eagerly, to > > drive down the initialization cost. I.e. have a separate allocation path > > for chunks that have never been allocated, by having a > > SlabBlock->free_offset or such. > > > > Sure, it adds a branch to the allocation happy path, but it also makes the > > first allocation for a chunk cheaper, because there's no need to get the > > next > > element from the freelist (adding a likely cache miss). And it should make > > the > > allocation of a new block faster by a lot. > > > > Not sure what you mean by 'not populate eagerly' so can't comment :-( Instead of populating a linked list with all chunks upon creation of a block - which requires touching a fair bit of memory - keep a per-block pointer (or an offset) into "unused" area of the block. When allocating from the block and theres still "unused" memory left, use that, instead of bothering with the freelist. I tried that, and it nearly got slab up to the allocation/freeing performance of aset.c (while winning after allocation, due to the higher memory density). > > > > 4) Less of a performance, and more of a usability issue: The constant > > > > block size strikes me as problematic. Most users of an allocator can > > > > sometimes be used with a small amount of data, and sometimes with a > > > > large amount. > > > > > > > > > > I doubt this is worth the effort, really. The constant block size makes > > > various places much simpler (both to code and reason about), so this > > > should > > > not make a huge difference in performance. And IMHO the block size is > > > mostly > > > an implementation detail, so I don't see that as a usability issue. > > > > Hm? It's something the user has to specify, so I it's not really an > > implementation detail. It needs to be specified without sufficient > > information, as well, since externally one doesn't know how much memory the > > block header and chunk headers + rounding up will use, so computing a good > > block size isn't easy. I've wondered whether it should just be a count... > > > > I think this is mixing two problems - how to specify the block size, and > whether the block size is constant (as in slab) or grows over time (as in > allocset). That was in response to the "implementation detail" bit solely. > But growing the block size seems problematic for long-lived contexts with > workloads that change a lot over time - imagine e.g. decoding many small > transactions, with one huge transaction mixed in. The one huge transaction > will grow the block size, and we'll keep using it forever. But in that case > we might have just as well allocate the large blocks from the start, I > guess. I was thinking of capping the growth fairly low. I don't think after a 16x growth or so you're likely to still see allocation performance gains with slab. And I don't think that'd be too bad for decoding - we'd start with a small initial block size, and in many workloads that will be enough, and just workloads where that doesn't suffice will adapt performance wise. And: Medium term I wouldn't expect reorderbuffer.c to stay the only slab.c user... > > Why do you not think it's relevant for performance? Either one causes too > > much > > memory usage by using a too large block size, wasting memory, or one ends up > > loosing perf through frequent allocations? > > True. I simply would not expect this to make a huge difference - I may be > wrong, and I'm sure there are workloads where it matters. But I still think > it's easier to just use larger blocks than to make the slab code more > complex. IDK. I'm looking at using slab as part of a radix tree implementation right now. Which I'd hope to be used in various different situations. So it's hard to choose the right block size - and it does seem to matter for performance. Greetings, Andres Freund
Re: slab allocator performance issues
On 7/17/21 11:14 PM, Andres Freund wrote: Hi, On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote: On 7/17/21 9:43 PM, Andres Freund wrote: 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? +1 I think it makes perfect sense to not free the blocks immediately, and keep one (or a small number) as a cache. I'm not sure why we decided not to have a "keeper" block, but I suspect memory consumption was my main concern at that point. But I never expected the cost to be this high. I think one free block might be too low in some cases. It's pretty common to have workloads where the number of allocations is "bursty", and it's imo one case where one might justifiably want to use a slab allocator... Perhaps a portion of a high watermark? Or a portion of the in use blocks? I think the portion of watermark would be problematic for cases with one huge transaction - that'll set a high watermark, and we'll keep way too many free blocks. But the portion of in use blocks might work, I think. Hm. I wonder if we should just not populate the freelist eagerly, to drive down the initialization cost. I.e. have a separate allocation path for chunks that have never been allocated, by having a SlabBlock->free_offset or such. Sure, it adds a branch to the allocation happy path, but it also makes the first allocation for a chunk cheaper, because there's no need to get the next element from the freelist (adding a likely cache miss). And it should make the allocation of a new block faster by a lot. Not sure what you mean by 'not populate eagerly' so can't comment :-( 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). Hmm, I think you're right we could simply use the pointers, but I have not tried that. I quickly tried that, and it does seem to improve matters considerably. The block initialization still shows up as expensive, but not as bad. The div and imul are gone (exept in an assertion build right now). The list manipulation still is visible. Understood. I didn't expect this to be a full solution. 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the number of freelists? Yeah. The purpose of organizing the freelists like this is to prioritize the "more full" blocks" when allocating new chunks, in the hope that the "less full" blocks will end up empty and freed faster. But this is naturally imprecise, and strongly depends on the workload, of course, and I bet for most cases a less precise approach would work just as fine. I'm not sure how exactly would the upper limit you propose work, but perhaps we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. So after the alloc/free we'd calculate the new freelist index as (nfree/16) and only moved the block if the index changed. This would reduce the overhead to 1/16 and probably even more in practice. Of course, we could also say we have e.g. 8 freelists and work the ranges backwards from that, I
Re: slab allocator performance issues
Hi, On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote: > On 7/17/21 9:43 PM, Andres Freund wrote: > > 1) If allocations are short-lived slab.c, can end up constantly > > freeing/initializing blocks. Which requires fairly expensively iterating > > over > > all potential chunks in the block and initializing it. Just to then free > > that > > memory again after a small number of allocations. The extreme case of this > > is > > when there are phases of alloc/free of a single allocation. > > > > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the > > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course > > that > > only works if the problem is with the only, so it's not a great > > approach. Perhaps just keeping the last allocated block around would work? > > > > +1 > > I think it makes perfect sense to not free the blocks immediately, and keep > one (or a small number) as a cache. I'm not sure why we decided not to have > a "keeper" block, but I suspect memory consumption was my main concern at > that point. But I never expected the cost to be this high. I think one free block might be too low in some cases. It's pretty common to have workloads where the number of allocations is "bursty", and it's imo one case where one might justifiably want to use a slab allocator... Perhaps a portion of a high watermark? Or a portion of the in use blocks? Hm. I wonder if we should just not populate the freelist eagerly, to drive down the initialization cost. I.e. have a separate allocation path for chunks that have never been allocated, by having a SlabBlock->free_offset or such. Sure, it adds a branch to the allocation happy path, but it also makes the first allocation for a chunk cheaper, because there's no need to get the next element from the freelist (adding a likely cache miss). And it should make the allocation of a new block faster by a lot. > > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, > > taking > > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, > > it's > > still slow enough there to be very worrisome. > > > > I don't see a way to get around the division while keeping the freelist > > structure as is. But: > > > > ISTM that we only need the index because of the free-chunk list, right? Why > > don't we make the chunk list use actual pointers? Is it concern that that'd > > increase the minimum allocation size? If so, I see two ways around that: > > First, we could make the index just the offset from the start of the block, > > that's much cheaper to calculate. Second, we could store the next pointer in > > SlabChunk->slab/block instead (making it a union) - while on the freelist we > > don't need to dereference those, right? > > > > I suspect both would also make the block initialization a bit cheaper. > > > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > > an imul, which isn't exactly fast either (and uses a lot of execution > > ports). > > > > Hmm, I think you're right we could simply use the pointers, but I have not > tried that. I quickly tried that, and it does seem to improve matters considerably. The block initialization still shows up as expensive, but not as bad. The div and imul are gone (exept in an assertion build right now). The list manipulation still is visible. > > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink > > shows up prominently in profiles. That's ~tripling the number of cachelines > > touched in the happy path, with unpredictable accesses to boot. > > > > Perhaps we could reduce the precision of slab->freelist indexing to amortize > > that cost? I.e. not have one slab->freelist entry for each nfree, but > > instead > > have an upper limit on the number of freelists? > > > > Yeah. The purpose of organizing the freelists like this is to prioritize the > "more full" blocks" when allocating new chunks, in the hope that the "less > full" blocks will end up empty and freed faster. > > But this is naturally imprecise, and strongly depends on the workload, of > course, and I bet for most cases a less precise approach would work just as > fine. > > I'm not sure how exactly would the upper limit you propose work, but perhaps > we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. > So after the alloc/free we'd calculate the new freelist index as (nfree/16) > and only moved the block if the index changed. This would reduce the > overhead to 1/16 and probably even more in practice. > Of course, we could also say we have e.g. 8 freelists and work the ranges > backwards from that, I guess that's what you propose. Yea, I was thinking something along those lines. As you say, ow about there's always at most 8 freelists or such. During
Re: slab allocator performance issues
Hi, On 7/17/21 9:43 PM, Andres Freund wrote: Hi, I just tried to use the slab allocator for a case where aset.c was bloating memory usage substantially. First: It worked wonders for memory usage, nearly eliminating overhead. But it turned out to cause a *substantial* slowdown. With aset the allocator is barely in the profile. With slab the profile is dominated by allocator performance. slab: NOTICE: 0: 1 ordered insertions in 5.216287 seconds, 19170724/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 Overhead Command Shared Object Symbol + 28.27% postgres postgres [.] SlabAlloc +9.64% postgres bdbench.so[.] bfm_delete +9.03% postgres bdbench.so[.] bfm_set +8.39% postgres bdbench.so[.] bfm_lookup +8.36% postgres bdbench.so[.] bfm_set_leaf.constprop.0 +8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms +6.88% postgres bdbench.so[.] bfm_delete_leaf +3.24% postgres libc-2.31.so [.] _int_malloc +2.58% postgres bdbench.so[.] bfm_tests +2.33% postgres postgres [.] SlabFree +1.29% postgres libc-2.31.so [.] _int_free +1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0 aset: NOTICE: 0: 1 ordered insertions in 2.082602 seconds, 48016848/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 + 16.43% postgres bdbench.so[.] bfm_lookup + 15.38% postgres bdbench.so[.] bfm_delete + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 12.65% postgres bdbench.so[.] bfm_set + 12.15% postgres bdbench.so[.] bfm_set_leaf.constprop.0 + 10.57% postgres bdbench.so[.] bfm_delete_leaf +4.05% postgres bdbench.so[.] bfm_tests +2.93% postgres [kernel.vmlinux] [k] clear_page_erms +1.59% postgres postgres [.] AllocSetAlloc +1.15% postgres bdbench.so[.] memmove@plt +1.06% postgres bdbench.so[.] bfm_grow_leaf_16 OS: NOTICE: 0: 1 ordered insertions in 2.089790 seconds, 47851690/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 That is somewhat surprising - part of the promise of a slab allocator is that it's fast... This is caused by multiple issues, I think. Some of which seems fairly easy to fix. 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? +1 I think it makes perfect sense to not free the blocks immediately, and keep one (or a small number) as a cache. I'm not sure why we decided not to have a "keeper" block, but I suspect memory consumption was my main concern at that point. But I never expected the cost to be this high. 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). Hmm, I think you're right we could simply use the pointers, but I have not tried that. 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the
Re: slab allocator performance issues
Hi, On 2021-07-17 12:43:33 -0700, Andres Freund wrote: > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, > taking > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's > still slow enough there to be very worrisome. > > I don't see a way to get around the division while keeping the freelist > structure as is. But: > > ISTM that we only need the index because of the free-chunk list, right? Why > don't we make the chunk list use actual pointers? Is it concern that that'd > increase the minimum allocation size? If so, I see two ways around that: > First, we could make the index just the offset from the start of the block, > that's much cheaper to calculate. Second, we could store the next pointer in > SlabChunk->slab/block instead (making it a union) - while on the freelist we > don't need to dereference those, right? > > I suspect both would also make the block initialization a bit cheaper. > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > an imul, which isn't exactly fast either (and uses a lot of execution ports). Oh - I just saw that effectively the allocation size already is a uintptr_t at minimum. I had only seen /* Make sure the linked list node fits inside a freed chunk */ if (chunkSize < sizeof(int)) chunkSize = sizeof(int); but it's followed by /* chunk, including SLAB header (both addresses nicely aligned) */ fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize); which means we are reserving enough space for a pointer on just about any platform already? Seems we can just make that official and reserve space for a pointer as part of the chunk size rounding up, instead of fullChunkSize? Greetings, Andres Freund