Re: slab allocator performance issues

2022-12-20 Thread David Rowley
On Tue, 20 Dec 2022 at 21:19, John Naylor  wrote:
>
>
> On Tue, Dec 20, 2022 at 10:36 AM David Rowley  wrote:
> >
> > I'm planning on pushing the attached v3 patch shortly. I've spent
> > several days reading over this and testing it in detail along with
> > adding additional features to the SlabCheck code to find more
> > inconsistencies.
>
> FWIW, I reran my test from last week and got similar results.

Thanks a lot for testing that stuff last week.  I got a bit engrossed
in the perf weirdness and forgot to reply.  I found they made much
more sense after using palloc0 and touching the allocated memory just
before freeing.  I think this is a more realistic test.

I've now pushed the patch after making a small adjustment to the
version I sent earlier.

David




Re: slab allocator performance issues

2022-12-20 Thread John Naylor
On Tue, Dec 20, 2022 at 10:36 AM David Rowley  wrote:
>
> I'm planning on pushing the attached v3 patch shortly. I've spent
> several days reading over this and testing it in detail along with
> adding additional features to the SlabCheck code to find more
> inconsistencies.

FWIW, I reran my test from last week and got similar results.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-14 Thread John Naylor
On Tue, Dec 13, 2022 at 7:50 AM David Rowley  wrote:
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor 
wrote:

> > While allocation is markedly improved, freeing looks worse here. The
proportion is surprising because only about 2% of nodes are freed during
the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;

Simply running over a longer period of time like this makes the SlabFree
difference much closer to your results, so it doesn't seem out of line
anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current
slab, with a 5% reduction in total time:

500k ints:

v13-0001-0005
average of 30: 217ms

  47.61%  postgres  postgres [.] rt_set
  20.99%  postgres  postgres [.] SlabAlloc
  10.00%  postgres  postgres [.] rt_node_insert_inner.isra.0
   6.87%  postgres  [unknown][k] 0xbce011b7
   3.53%  postgres  postgres [.] MemoryContextAlloc
   2.82%  postgres  postgres [.] SlabFree

+slab v4
average of 30: 206ms

  51.13%  postgres  postgres [.] rt_set
  14.08%  postgres  postgres [.] SlabAlloc
  11.41%  postgres  postgres [.] rt_node_insert_inner.isra.0
   7.44%  postgres  [unknown][k] 0xbce011b7
   3.89%  postgres  postgres [.] MemoryContextAlloc
   3.39%  postgres  postgres [.] SlabFree

It doesn't look mysterious anymore, but I went ahead and took some more
perf measurements, including for cache misses. My naive impression is that
we're spending a bit more time waiting for data, but having to do less work
with it once we get it, which is consistent with your earlier comments:

perf stat -p $pid sleep 2
v13:
  2,001.55 msec task-clock:u #1.000 CPUs
utilized
 0  context-switches:u   #0.000 /sec

 0  cpu-migrations:u #0.000 /sec

   311,690  page-faults:u#  155.724 K/sec

 3,128,740,701  cycles:u #1.563 GHz

 4,739,333,861  instructions:u   #1.51  insn
per cycle
   820,014,588  branches:u   #  409.690 M/sec

 7,385,923  branch-misses:u  #0.90% of all
branches

+slab v4:
  2,001.09 msec task-clock:u #1.000 CPUs
utilized
 0  context-switches:u   #0.000 /sec

 0  cpu-migrations:u #0.000 /sec

   326,017  page-faults:u#  162.920 K/sec

 3,016,668,818  cycles:u #1.508 GHz

 4,324,863,908  instructions:u   #1.43  insn
per cycle
   761,839,927  branches:u   #  380.712 M/sec

 7,718,366  branch-misses:u  #1.01% of all
branches


perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13:  LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-12 Thread David Rowley
Thanks for testing the patch.

On Mon, 12 Dec 2022 at 20:14, John Naylor  wrote:
> v13-0001 to 0005:

>  2.60%  postgres  postgres [.] SlabFree

> + v4 slab:

>4.98%  postgres  postgres  [.] SlabFree
>
> While allocation is markedly improved, freeing looks worse here. The 
> proportion is surprising because only about 2% of nodes are freed during the 
> load, but doing that takes up 10-40% of the time compared to allocating.

I've tried to reproduce this with the v13 patches applied and I'm not
really getting the same as you are. To run the function 100 times I
used:

select x, a.* from generate_series(1,100) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;

(I had to add the * (1+x-x) to add a lateral dependency to stop the
function just being executed once)

v13-0001 - 0005 gives me:

  37.71%  postgres [.] rt_set
  19.24%  postgres [.] SlabAlloc
   8.73%  [kernel] [k] clear_page_rep
   5.21%  postgres [.] rt_node_insert_inner.isra.0
   2.63%  [kernel] [k] asm_exc_page_fault
   2.24%  postgres [.] SlabFree

and fairly consistently 122 ms runtime per call.

Applying v4 slab patch I get:

  41.06%  postgres [.] rt_set
  10.84%  postgres [.] SlabAlloc
   9.01%  [kernel] [k] clear_page_rep
   6.49%  postgres [.] rt_node_insert_inner.isra.0
   2.76%  postgres [.] SlabFree

and fairly consistently 112 ms per call.

I wonder if you can consistently get the same result on another
compiler or after patching something like master~50 or master~100.
Maybe it's just a code alignment thing.

Looking at the annotation of perf report for SlabFree with the patched
version I see:

  │
  │ /* push this chunk onto the head of the free list */
  │ *(MemoryChunk **) pointer = block->freehead;
 0.09 │   mov 0x10(%r8),%rax
  │ slab = block->slab;
59.15 │   mov (%r8),%rbp
  │ *(MemoryChunk **) pointer = block->freehead;
 9.43 │   mov %rax,(%rdi)
  │ block->freehead = chunk;
  │
  │ block->nfree++;

I think what that's telling me is that dereferencing the block's
memory is slow, likely due to that particular cache line not being
cached any longer. I tried running the test with 10,000 ints instead
of 500,000 so that there would be less CPU cache pressure. I see:

 29.76 │   mov (%r8),%rbp
   │ *(MemoryChunk **) pointer = block->freehead;
 12.72 │   mov %rax,(%rdi)
   │ block->freehead = chunk;
   │
   │ block->nfree++;
   │   mov 0x8(%r8),%eax
   │ block->freehead = chunk;
  4.27 │   mov %rdx,0x10(%r8)
   │ SlabBlocklistIndex():
   │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
   │   mov $0x1,%edx
   │ SlabFree():
   │ block->nfree++;
   │   lea 0x1(%rax),%edi
   │   mov %edi,0x8(%r8)
   │ SlabBlocklistIndex():
   │ int32   blocklist_shift = slab->blocklist_shift;
   │   mov 0x70(%rbp),%ecx
   │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
  8.46 │   shl %cl,%edx

various other instructions in SlabFree are proportionally taking
longer now. For example the bitshift at the end was insignificant
previously. That indicates to me that this is due to caching effects.
We must fetch the block in SlabFree() in both versions. It's possible
that something is going on in SlabAlloc() that is causing more useful
cachelines to be evicted, but (I think) one of primary design goals
Andres was going for was to reduce that. For example not having to
write out the freelist for an entire block when the block is first
allocated means not having to load possibly all cache lines for the
entire block anymore.

I tried looking at perf stat during the run.

Without slab changes:

drowley@amd3990x:~$ sudo perf stat --pid=74922 sleep 2
 Performance counter stats for process id '74922':

  2,000.74 msec task-clock#1.000 CPUs utilized
 4  context-switches  #1.999 /sec
 0  cpu-migrations#0.000 /sec
   578,139  page-faults   #  288.963 K/sec
 8,614,687,392  cycles#4.306 GHz
   (83.21%)
   682,574,688  stalled-cycles-frontend   #7.92% frontend
cycles idle (83.33%)
 4,822,904,271  stalled-cycles-backend#   55.98% backend
cycles idle  (83.41%)
11,447,124,105  instructions  #1.33  insn per cycle
  #0.42  stalled
cycles per insn  (83.41%)
 1,947,647,575  branches  #  973.464 M/sec
   (83.41%)
13,914,897  branch-misses #0.71% of all
branches   

Re: slab allocator performance issues

2022-12-11 Thread John Naylor
On Sat, Dec 10, 2022 at 11:02 AM David Rowley  wrote:
> [v4]

Thanks for working on this!

I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but
should be useful enough for testing allocation speed), only applying the
first five, which are local-memory only. The benchmark is not meant to
represent a realistic workload, and primarily stresses traversal and
allocation of the smallest node type. Minimum of five, with turbo-boost
off, on recent Intel laptop hardware:

v13-0001 to 0005:

# select * from bench_load_random_int(500 * 1000);
 mem_allocated | load_ms
---+-
 151123432 | 222

47.06%  postgres  postgres [.] rt_set
22.89%  postgres  postgres [.] SlabAlloc
 9.65%  postgres  postgres [.] rt_node_insert_inner.isra.0
 5.94%  postgres  [unknown][k] 0xb5e011b7
 3.62%  postgres  postgres [.] MemoryContextAlloc
 2.70%  postgres  libc.so.6[.] __memmove_avx_unaligned_erms
 2.60%  postgres  postgres [.] SlabFree

+ v4 slab:

# select * from bench_load_random_int(500 * 1000);
 mem_allocated | load_ms
---+-
 152463112 | 213

  52.42%  postgres  postgres  [.] rt_set
  12.80%  postgres  postgres  [.] SlabAlloc
   9.38%  postgres  postgres  [.] rt_node_insert_inner.isra.0
   7.87%  postgres  [unknown] [k] 0xb5e011b7
   4.98%  postgres  postgres  [.] SlabFree

While allocation is markedly improved, freeing looks worse here. The
proportion is surprising because only about 2% of nodes are freed during
the load, but doing that takes up 10-40% of the time compared to allocating.

num_keys = 50, height = 7
n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257

Sidenote: I don't recall ever seeing vsyscall (I think that's what the
0xb5e011b7 address is referring to) in a profile, so not sure what
is happening there.

[1]
https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-09 Thread David Rowley
.On Mon, 5 Dec 2022 at 23:18, John Naylor  wrote:
>
>
> On Mon, Dec 5, 2022 at 3:02 PM David Rowley  wrote:

> > Going by [2], the instructions are very different with each method, so
> > other machines with different latencies on those instructions might
> > show something different. I attached what I used to test if anyone
> > else wants a go.
>
> I get about 0.1% difference on my machine.  Both ways boil down to (on gcc) 3 
> instructions with low latency. The later ones need the prior results to 
> execute, which I think is what the XXX comment "isn't great" was referring 
> to. The new coding is more mysterious (does it do the right thing on all 
> platforms?), so I guess the original is still the way to go unless we get a 
> better idea.

I don't think it would work well on a one's complement machine, but I
don't think we support those going by the comments above RIGHTMOST_ONE
in bitmapset.c.  In anycase, I found that it wasn't any faster than
what Andres wrote. In fact, even changing the code there to "index =
(freecount > 0);" seems to do very little to increase performance. I
do see that having 3 freelist items performs a decent amount better
than having 9. However, which workload is run may alter the result of
that, assuming that keeping new allocations on fuller blocks is a
winning strategy for the CPU's caches.

I've now done quite a bit more work on Andres' patch to try and get it
into (hopefully) somewhere close to a committable shape.

I'm fairly happy with what's there now. It does seem to perform much
better than current master, especially so when the workload would have
caused master to continually malloc and free an entire block when
freeing and allocating a single chunk.

I've done some basic benchmarking, mostly using the attached alloc_bench patch.

If I run:

select *, round(slab_result / aset_result * 100 - 100,1)::text || '%'
as slab_slower_by
from (
select
chunk_size,
keep_chunks,
sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks,
1024*1024*1024, 'slab'))::numeric(1000,3) as slab_result,
sum(pg_allocate_memory_test(chunk_size, chunk_size * keep_chunks,
1024*1024*1024, 'aset'))::numeric(1000,3) as aset_result
from

(values(1),(10),(50),(100),(200),(300),(400),(500),(1000),(2000),(3000),(4000),(5000),(1))
v1(keep_chunks),
(values(64),(128),(256),(512),(1024)) v2(chunk_size)
group by rollup(1,2)
);

The results for the 64-byte chunk are shown in the attached chart.
It's not as fast as aset, but much faster than master's slab.c
The first blue bar of the chart is well above the vertical axis. It
took master 1118958 milliseconds for that test. The attached patch
took 194 ms. The rest of the tests seem to put the patched code around
somewhere in the middle between the unpatched code and aset.c's
performance.

The benchmark I did was entirely a FIFO workload that keeps around
"keep_chunks" at once before starting to free the oldest chunks.  I
know Tomas has some LIFO and random benchmarks. I edited that code [1]
a little to add support for other context types so that a comparison
could be done more easily, however, I'm getting very weird performance
results where sometimes it runs about twice as fast (Tomas mentioned
he got this too). I'm not seeing that with my own benchmarking
function, so I'm wondering if there's something weird going on with
the benchmark itself rather than the slab.c code.

I've likely made much more changes than I can list here, but here are
a few of the more major ones:

1. Make use of dclist for empty blocks
2. In SlabAlloc() allocate chunks from the freelist before the unused list.
3. Added output for showing information about empty blocks in the
SlabStats output.
4. Renamed the context freelists to blocklist.  I found this was
likely just confusing things with the block-level freelist.  In any
case, it seemed weird to have freelist[0] store full blocks. Not much
free there! I renamed to blocklist[].
5. I did a round of changing the order of the fields in SlabBlock.
This seems to affect performance quite a bit. Having the context first
seems to improve performance. Having the blocklist[] node last also
helps.
6. Removed nblocks and headerSize from SlabContext. headerSize is no
longer needed. nblocks was only really used for Asserts and
SlabIsEmpty.  I changed the Asserts to use a local count of blocks and
changed SlabIsEmpty to look at the context's mem_allocated.
7. There's now no integer division in any of the alloc and free code.
The only time we divide by fullChunkSize is in the check function.
8. When using a block from the emptyblock list, I changed the code to
not re-init the block. It's now used as it was left previously. This
means no longer having to build the freelist again.
9. Updated all comments to reflect the current state of the code.

Some things I thought about but didn't do:
a. Change the size of SlabContext's chunkSize, fullChunkSize and
blockSize to be uint32 instead of Size.  It might be possible 

Re: slab allocator performance issues

2022-12-05 Thread Robert Haas
On Fri, Sep 10, 2021 at 5:07 PM Tomas Vondra
 wrote:
> Turns out it's pretty difficult to benchmark this, because the results
> strongly depend on what the backend did before.

What you report here seems to be mostly cold-cache effects, with which
I don't think we need to be overly concerned. We don't want big
regressions in the cold-cache case, but there is always going to be
some overhead when a new backend starts up, because you've got to
fault some pages into the heap/malloc arena/whatever before they can
be efficiently accessed. What would be more concerning is if we found
out that the performance depended heavily on the internal state of the
allocator. For example, suppose you have two warmup routines W1 and
W2, each of which touches the same amount of total memory, but with
different details. Then you have a benchmark B. If you do W1-B and
W2-B and the time for B varies dramatically between them, then you've
maybe got an issue. For instance, it could indicate that the allocator
has issue when the old and new allocations are very different sizes,
or something like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: slab allocator performance issues

2022-12-05 Thread John Naylor
On Mon, Dec 5, 2022 at 3:02 PM David Rowley  wrote:
>
> On Fri, 11 Nov 2022 at 22:20, John Naylor 
wrote:

> > #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list?  What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.

The intent must have slipped my mind along the way.

> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version.  I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.

I get about 0.1% difference on my machine.  Both ways boil down to (on gcc)
3 instructions with low latency. The later ones need the prior results to
execute, which I think is what the XXX comment "isn't great" was referring
to. The new coding is more mysterious (does it do the right thing on all
platforms?), so I guess the original is still the way to go unless we get a
better idea.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-05 Thread David Rowley
On Fri, 11 Nov 2022 at 22:20, John Naylor  wrote:
>
>
> On Wed, Oct 12, 2022 at 4:37 PM David Rowley  wrote:
> > [v3]
>
> + /*
> + * Compute a shift that guarantees that shifting chunksPerBlock with it
> + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for 
> full blocks).
> + */
> + slab->freelist_shift = 0;
> + while ((slab->chunksPerBlock >> slab->freelist_shift) >= 
> (SLAB_FREELIST_COUNT - 1))
> + slab->freelist_shift++;
>
> + /*
> + * Ensure, without a branch, that index 0 is only used for blocks entirely
> + * without free chunks.
> + * XXX: There probably is a cheaper way to do this. Needing to shift twice
> + * by slab->freelist_shift isn't great.
> + */
> + index = (freecount + (1 << slab->freelist_shift) - 1) >> 
> slab->freelist_shift;
>
> How about something like
>
> #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);

Doesn't this create a sort of round-robin use of the free list?  What
we want is a sort of "histogram" bucket set of free lists so we can
group together blocks that have a close-enough free number of chunks.
Unless I'm mistaken, I think what you have doesn't do that.

I wondered if simply:

index = -(-freecount >> slab->freelist_shift);

would be faster than Andres' version.  I tried it out and on my AMD
machine, it's about the same speed. Same on a Raspberry Pi 4.

Going by [2], the instructions are very different with each method, so
other machines with different latencies on those instructions might
show something different. I attached what I used to test if anyone
else wants a go.

AMD Zen2
$ ./freecount 20
Test 'a' in 0.922766 seconds
Test 'd' in 0.922762 seconds (0.000433% faster)

RPI4
$ ./freecount 20
Test 'a' in 3.341350 seconds
Test 'd' in 3.338690 seconds (0.079672% faster)

That was gcc. Trying it with clang, it went in a little heavy-handed
and optimized out my loop, so some more trickery might be needed for a
useful test on that compiler.

David

[2] https://godbolt.org/z/dh95TohEG

#include 
#include 
#include 

#define SLAB_FREELIST_COUNT 9

int main(int argc, char **argv)
{
clock_t start, end;
double v1_time, v2_time;
int freecount;
int freelist_shift = 0;
int chunksPerBlock;
int index;
int zerocount = 0;

if (argc < 2)
{
printf("Syntax: %s \n", argv[0]);
return -1;
}

chunksPerBlock = atoi(argv[1]);
printf("chunksPerBlock = %d\n", chunksPerBlock);

while ((chunksPerBlock >> freelist_shift) >= (SLAB_FREELIST_COUNT - 1))
freelist_shift++;
printf("freelist_shift = %d\n", freelist_shift);

start = clock();

for (freecount = 0; freecount <= chunksPerBlock; freecount++)
{
index = (freecount + (1 << freelist_shift) - 1) >> 
freelist_shift;

/* try to prevent optimizing the above out */
if (index == 0)
zerocount++;
}


end = clock();
v1_time = (double) (end - start) / CLOCKS_PER_SEC;
printf("Test 'a' in %f seconds\n", v1_time);
printf("zerocount = %d\n", zerocount);

zerocount = 0;
start = clock();

for (freecount = 0; freecount <= chunksPerBlock; freecount++)
{
index = -(-freecount >> freelist_shift);

/* try to prevent optimizing the above out */
if (index == 0)
zerocount++;

}

end = clock();
v2_time = (double) (end - start) / CLOCKS_PER_SEC;
printf("Test 'd' in %f seconds (%f%% faster)\n", v2_time, v1_time / 
v2_time * 100.0 - 100.0);
printf("zerocount = %d\n", zerocount);
return 0;
}

Re: slab allocator performance issues

2022-11-11 Thread John Naylor
On Wed, Oct 12, 2022 at 4:37 PM David Rowley  wrote:
> [v3]

+ /*
+ * Compute a shift that guarantees that shifting chunksPerBlock with it
+ * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used
for full blocks).
+ */
+ slab->freelist_shift = 0;
+ while ((slab->chunksPerBlock >> slab->freelist_shift) >=
(SLAB_FREELIST_COUNT - 1))
+ slab->freelist_shift++;

+ /*
+ * Ensure, without a branch, that index 0 is only used for blocks entirely
+ * without free chunks.
+ * XXX: There probably is a cheaper way to do this. Needing to shift twice
+ * by slab->freelist_shift isn't great.
+ */
+ index = (freecount + (1 << slab->freelist_shift) - 1) >>
slab->freelist_shift;

How about something like

#define SLAB_FREELIST_COUNT ((1<<3) + 1)
index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);

and dispense with both freelist_shift and the loop that computes it?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-10-12 Thread David Rowley
On Sat, 11 Sept 2021 at 09:07, Tomas Vondra
 wrote:
> I've been investigating the regressions in some of the benchmark
> results, together with the generation context benchmarks [1].

I've not looked into the regression you found with this yet, but I did
rebase the patch.  slab.c has seen quite a number of changes recently.

I didn't spend a lot of time checking over the patch. I mainly wanted
to see what the performance was like before reviewing in too much
detail.

To test the performance, I used [1] and ran:

select pg_allocate_memory_test(, 1024*1024,
10::bigint*1024*1024*1024, 'slab');

that basically allocates chunks of  and keeps around 1MB of
them at a time and allocates a total of 10GBs of them.

I saw:

Master:
16 byte chunk = 8754.678 ms
32 byte chunk = 4511.725 ms
64 byte chunk = 2244.885 ms
128 byte chunk = 1135.349 ms
256  byte chunk = 548.030 ms
512 byte chunk = 272.017 ms
1024 byte chunk = 144.618 ms

Master + attached patch:
16 byte chunk = 5255.974 ms
32 byte chunk = 2640.807 ms
64 byte chunk = 1328.949 ms
128 byte chunk = 668.078 ms
256 byte chunk = 330.564 ms
512 byte chunk = 166.844 ms
1024 byte chunk = 85.399 ms

So patched runs in about 60% of the time that master runs in.

I plan to look at the patch in a bit more detail and see if I can
recreate and figure out the regression that Tomas reported. For now, I
just want to share the rebased patch.

The only thing I really adjusted from Andres' version is to instead of
using pointers for the linked list block freelist, I made it store the
number of bytes into the block that the chunk is.  This means we can
use 4 bytes instead of 8 bytes for these pointers.  The block size is
limited to 1GB now anyway, so 32-bit is large enough for these
offsets.

David

[1] 
https://www.postgresql.org/message-id/attachment/137056/allocate_performance_functions.patch.txt
From 2b89d993b0294d5c0fe0a8333bccc555337ac979 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 12 Oct 2022 09:30:24 +1300
Subject: [PATCH v3] WIP: slab performance.

Author: Andres Freund
Discussion: 
https://postgr.es/m/20210717194333.mr5io3zup3kxa...@alap3.anarazel.de
---
 src/backend/utils/mmgr/slab.c | 449 ++
 1 file changed, 294 insertions(+), 155 deletions(-)

diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index 1a0b28f9ea..58b8e2d67c 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -23,15 +23,18 @@
  * global (context) level. This is possible as the chunk size (and thus 
also
  * the number of chunks per block) is fixed.
  *
- * On each block, free chunks are tracked in a simple linked list. Contents
- * of free chunks is replaced with an index of the next free chunk, forming
- * a very simple linked list. Each block also contains a counter of free
- * chunks. Combined with the local block-level freelist, it makes it 
trivial
- * to eventually free the whole block.
+ * On each block, never allocated chunks are tracked by a simple offset, 
and
+ * free chunks are tracked in a simple linked list. The offset approach
+ * avoids needing to iterate over all chunks when allocating a new block,
+ * which would cause page faults and cache pollution. Contents of free 
chunks
+ * is replaced with a the offset (in bytes) from the block pointer to the
+ * next free chunk, forming a very simple linked list. Each block also
+ * contains a counter of free chunks. Combined with the local block-level
+ * freelist, it makes it trivial to eventually free the whole block.
  *
  * At the context level, we use 'freelist' to track blocks ordered by 
number
  * of free chunks, starting with blocks having a single allocated chunk, 
and
- * with completely full blocks on the tail.
+ * with completely full blocks on the tail. XXX
  *
  * This also allows various optimizations - for example when searching for
  * free chunk, the allocator reuses space from the fullest blocks first, in
@@ -44,7 +47,7 @@
  * case this performs as if the pointer was not maintained.
  *
  * We cache the freelist index for the blocks with the fewest free chunks
- * (minFreeChunks), so that we don't have to search the freelist on every
+ * (minFreeChunkIndex), so that we don't have to search the freelist on 
every
  * SlabAlloc() call, which is quite expensive.
  *
  *-
@@ -60,6 +63,17 @@
 
 #define Slab_BLOCKHDRSZMAXALIGN(sizeof(SlabBlock))
 
+struct SlabBlock;
+struct SlabChunk;
+
+/*
+ * Number of actual freelists + 1, for full blocks. Full blocks are always at
+ * offset 0.
+ */
+#define SLAB_FREELIST_COUNT 9
+
+#define SLAB_RETAIN_EMPTY_BLOCK_COUNT 10
+
 /*
  * SlabContext is a specialized implementation of MemoryContext.
  */
@@ -72,13 +86,16 @@ typedef struct SlabContext
SizeblockSize;  /* block size */

Re: slab allocator performance issues

2021-09-10 Thread Tomas Vondra

Hi,

I've been investigating the regressions in some of the benchmark 
results, together with the generation context benchmarks [1].


Turns out it's pretty difficult to benchmark this, because the results 
strongly depend on what the backend did before. For example if I run 
slab_bench_fifo with the "decreasing" test for 32kB blocks and 512B 
chunks, I get this:


  select * from slab_bench_fifo(100, 32768, 512, 100, 1, 5000);

   mem_allocated | alloc_ms | free_ms
  ---+--+-
   528547840 |   155394 |   87440


i.e. palloc() takes ~155ms and pfree() ~87ms (and these result are 
stable, the numbers don't change much with more runs).


But if I run a set of "lifo" tests in the backend first, the results 
look like this:


   mem_allocated | alloc_ms | free_ms
  ---+--+-
   528547840 |41728 |   71524
  (1 row)

so the pallocs are suddenly about ~4x faster. Clearly, what the backend 
did before may have pretty dramatic impact on results, even for simple 
benchmarks like this.


Note: The benchmark was a single SQL script, running all the different 
workloads in the same backend.


I did a fair amount of perf profiling, and the main difference between 
the slow and fast runs seems to be this:


 0  page-faults:u 

 0  minor-faults:u 

 0  major-faults:u 



vs

20,634,153  page-faults:u 

20,634,153  minor-faults:u 

 0  major-faults:u 



Attached is a more complete perf stat output, but the page faults seem 
to be the main issue. My theory is that in the "fast" case, the past 
backend activity puts the glibc memory management into a state that 
prevents page faults in the benchmark.


But of course, this theory may be incomplete - for example it's not 
clear why running the benchmark repeatedly would not "condition" the 
backend the same way. But it doesn't - it's ~150ms even for repeated runs.


Secondly, I'm not sure this explains why some of the timings actually 
got much slower with the 0003 patch, when the sequence of the steps is 
still the same. Of course, it's possible 0003 changes the allocation 
pattern a bit, interfering with glibc memory management.


This leads to a couple of interesting questions, I think:

1) I've only tested this on Linux, with glibc. I wonder how it'd behave 
on other platforms, or with other allocators.


2) Which cases are more important? When the backend was warmed up, or 
when each benchmark runs in a new backend? It seems the "new backend" is 
something like a "worst case" leading to more page faults, so maybe 
that's the thing to watch. OTOH it's unlikely to have a completely new 
backend, so maybe not.


3) Can this teach us something about how to allocate stuff, to better 
"prepare" the backend for future allocations? For example, it's a bit 
strange that repeated runs of the same benchmark don't do the trick, for 
some reason.




regards


[1] 
https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com


--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
 Performance counter stats for process id '11829':

   219,869,739,737  cycles:u

   119,722,280,436  instructions:u#0.54  insn per cycle 

  #1.56  stalled cycles per 
insn
 7,784,566,858  cache-references:u  

 2,487,257,287  cache-misses:u#   31.951 % of all cache 
refs
 5,942,054,520  bus-cycles:u

 0  page-faults:u   

 0  minor-faults:u  

 0  major-faults:u  

   187,181,661,719  stalled-cycles-frontend:u #   85.13% frontend cycles 
idle   
   144,274,017,071  stalled-cycles-backend:u  #   65.62% backend cycles 
idle

  60.000876248 seconds time elapsed



 Performance counter stats for process id '11886':

   145,093,090,692  cycles:u

74,986,543,212  instructions:u#0.52  insn per cycle 

  #2.63  stalled cycles per 
insn
 4,753,764,781  cache-references:u  

 1,342,653,549  cache-misses:u#   28.244 % of all cache 
refs
 3,925,175,515  bus-cycles:u

20,634,153  page-faults:u   

20,634,153  minor-faults:u  

   

Re: slab allocator performance issues

2021-08-01 Thread Tomas Vondra



On 8/1/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote:

In the attached .ods file with results, the "comparison" sheets are the
interesting ones - the last couple columns compare the main metrics for the
two patches (labeled patch-1 and patch-2) to master.


I assume with patch-1/2 you mean the ones after the benchmark patch
itself?



Yes, those are the two WIP patches you shared on 19/7.




Overall, the results look quite good - patch-1 is mostly on par with master,
with maybe 5% variability in both directions. That's expected, considering
the patch does not aim to improve performance.


Not for slab anyway...



Maybe the hot/cold separation could have some effect, but probably not 
for the workloads I've tested.





The second patch brings some nice improvements - 30%-50% in most cases (for
both allocation and free) seems pretty nice. But for the "increase" FIFO
pattern (incrementally allocating/freeing more memory) there's a significant
regression - particularly for the allocation time. In some cases (larger
chunks, block size does not matter too much) it jumps from 25ms to almost
200ms.


I'm not surprised to see some memory usage increase some, but that
degree of time overhead does surprise me. ISTM there's something wrong.



Yeah, the higher amount of allocated memory is due to the couple fields 
added to the SlabBlock struct, but even that only affects a single case 
with 480B chunks and 1kB blocks. Seems fine to me, especially if we end 
up growing the slab blocks.


Not sure about the allocation time, though.


It'd probably worth benchmarking the different improvements inside the
WIP: slab performance. patch. There's some that I'd expect to be all
around improvements, whereas others likely aren't quite that clear
cut. I assume you'd prefer that I split the patch up?



Yeah, if you split that patch into sensible parts, I'll benchmark those. 
Also, we can add more interesting workloads if you have some ideas.





This seems unfortunate - the allocation pattern (FIFO, allocating more
memory over time) seems pretty common, and the slowdown is significant.


Did you analyze what causes the regressions?



No, not yet. I'll run the same set of benchmarks for the Generation, 
discussed in the other thread, and then I'll investigate this. But if 
you split the patch, that'd probably help.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: slab allocator performance issues

2021-08-01 Thread Andres Freund
Hi,

On 2021-08-01 19:59:18 +0200, Tomas Vondra wrote:
> In the attached .ods file with results, the "comparison" sheets are the
> interesting ones - the last couple columns compare the main metrics for the
> two patches (labeled patch-1 and patch-2) to master.

I assume with patch-1/2 you mean the ones after the benchmark patch
itself?


> Overall, the results look quite good - patch-1 is mostly on par with master,
> with maybe 5% variability in both directions. That's expected, considering
> the patch does not aim to improve performance.

Not for slab anyway...


> The second patch brings some nice improvements - 30%-50% in most cases (for
> both allocation and free) seems pretty nice. But for the "increase" FIFO
> pattern (incrementally allocating/freeing more memory) there's a significant
> regression - particularly for the allocation time. In some cases (larger
> chunks, block size does not matter too much) it jumps from 25ms to almost
> 200ms.

I'm not surprised to see some memory usage increase some, but that
degree of time overhead does surprise me. ISTM there's something wrong.

It'd probably worth benchmarking the different improvements inside the
WIP: slab performance. patch. There's some that I'd expect to be all
around improvements, whereas others likely aren't quite that clear
cut. I assume you'd prefer that I split the patch up?


> This seems unfortunate - the allocation pattern (FIFO, allocating more
> memory over time) seems pretty common, and the slowdown is significant.

Did you analyze what causes the regressions?

Greetings,

Andres Freund




Re: slab allocator performance issues

2021-07-20 Thread Ranier Vilela
Em ter., 20 de jul. de 2021 às 11:15, David Rowley 
escreveu:

> On Tue, 20 Jul 2021 at 10:04, Ranier Vilela  wrote:
> > Perhaps you would agree with me that in the most absolute of times,
> malloc will not fail.
> > So it makes more sense to test:
> > if (ret != NULL)
> > than
> > if (ret == NULL)
>
> I think it'd be better to use unlikely() for that.
>
Sure, it can be, but in this case, there is no way to reduce the scope.

regards,
Ranier Vilela


Re: slab allocator performance issues

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela  wrote:
> Perhaps you would agree with me that in the most absolute of times, malloc 
> will not fail.
> So it makes more sense to test:
> if (ret != NULL)
> than
> if (ret == NULL)

I think it'd be better to use unlikely() for that.

David




Re: slab allocator performance issues

2021-07-19 Thread Ranier Vilela
Em seg., 19 de jul. de 2021 às 17:56, Andres Freund 
escreveu:

> Hi,
>
> On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> > Sounds great! Thanks for investigating this and for the improvements.
> >
> > It might be good to do some experiments to see how the changes affect
> memory
> > consumption for practical workloads. I'm willing to spend soem time on
> that,
> > if needed.
>
> I've attached my changes. They're in a rough shape right now, but I
> think good enough for an initial look.
>
Hi Andres, I take a look.

Perhaps you would agree with me that in the most absolute of times, malloc
will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)

What might help branch prediction.
With this change wins too, the possibility
to reduce the scope of some variable.

Example:
+static void * pg_noinline
+AllocSetAllocLarge(AllocSet set, Size size, int flags)
+{
+ AllocBlock block;
+Size chunk_size;
+Size blksize;
+
+   /* check size, only allocation path where the limits could be hit */
+   MemoryContextCheckSize(>header, size, flags);
+
+   AssertArg(AllocSetIsValid(set));
+
+   chunk_size = MAXALIGN(size);
+   blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+   block = (AllocBlock) malloc(blksize);
+   if (block != NULL)
+ {
+  AllocChunk chunk;
+
+  set->header.mem_allocated += blksize;
+
+ block->aset = set;
+ block->freeptr = block->endptr = ((char *) block) + blksize;
+
+ /*
+  * Stick the new block underneath the active allocation block, if
any,
+  * so that we don't lose the use of the space remaining therein.
+  */
+ if (set->blocks != NULL)
+ {
+block->prev = set->blocks;
+block->next = set->blocks->next;
+if (block->next)
+block->next->prev = block;
+ set->blocks->next = block;
+}
+else
+{
+ block->prev = NULL;
+ block->next = NULL;
+ set->blocks = block;
+}
+
+chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
+chunk->size = chunk_size;
+
+return AllocSetAllocReturnChunk(set, size, chunk, chunk_size);
+  }
+
+  return NULL;
+}

regards,
Ranier Vilela


Re: slab allocator performance issues

2021-07-19 Thread Andres Freund
Hi,

On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> Sounds great! Thanks for investigating this and for the improvements.
> 
> It might be good to do some experiments to see how the changes affect memory
> consumption for practical workloads. I'm willing to spend soem time on that,
> if needed.

I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.

Greetings,

Andres Freund
>From af4cd1f0b64cd52d7eab342493e3dfd6b0d8388e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 19 Jul 2021 12:55:51 -0700
Subject: [PATCH 1/2] WIP: optimize allocations by separating hot from cold
 paths.

---
 src/include/nodes/memnodes.h|   4 +-
 src/include/utils/memutils.h|  13 +
 src/backend/utils/mmgr/aset.c   | 537 ++--
 src/backend/utils/mmgr/generation.c |  22 +-
 src/backend/utils/mmgr/mcxt.c   | 179 +++---
 src/backend/utils/mmgr/slab.c   |  14 +-
 6 files changed, 354 insertions(+), 415 deletions(-)

diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index e6a757d6a07..8a42d2ff999 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -57,10 +57,10 @@ typedef void (*MemoryStatsPrintFunc) (MemoryContext context, void *passthru,
 
 typedef struct MemoryContextMethods
 {
-	void	   *(*alloc) (MemoryContext context, Size size);
+	void	   *(*alloc) (MemoryContext context, Size size, int flags);
 	/* call this free_p in case someone #define's free() */
 	void		(*free_p) (MemoryContext context, void *pointer);
-	void	   *(*realloc) (MemoryContext context, void *pointer, Size size);
+	void	   *(*realloc) (MemoryContext context, void *pointer, Size size, int flags);
 	void		(*reset) (MemoryContext context);
 	void		(*delete_context) (MemoryContext context);
 	Size		(*get_chunk_space) (MemoryContext context, void *pointer);
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ff872274d44..2f75b4cca46 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -147,6 +147,19 @@ extern void MemoryContextCreate(MemoryContext node,
 
 extern void HandleLogMemoryContextInterrupt(void);
 extern void ProcessLogMemoryContextInterrupt(void);
+extern void *MemoryContextAllocationFailure(MemoryContext context, Size size, int flags);
+
+extern void MemoryContextSizeFailure(MemoryContext context, Size size, int flags) pg_attribute_noreturn();
+
+static inline void
+MemoryContextCheckSize(MemoryContext context, Size size, int flags)
+{
+	if (unlikely(!AllocSizeIsValid(size)))
+	{
+		if (!(flags & MCXT_ALLOC_HUGE) || !AllocHugeSizeIsValid(size))
+			MemoryContextSizeFailure(context, size, flags);
+	}
+}
 
 /*
  * Memory-context-type-specific functions
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 77872e77bcd..00878354392 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -263,9 +263,9 @@ static AllocSetFreeList context_freelists[2] =
 /*
  * These functions implement the MemoryContext API for AllocSet contexts.
  */
-static void *AllocSetAlloc(MemoryContext context, Size size);
+static void *AllocSetAlloc(MemoryContext context, Size size, int flags);
 static void AllocSetFree(MemoryContext context, void *pointer);
-static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
+static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size, int flags);
 static void AllocSetReset(MemoryContext context);
 static void AllocSetDelete(MemoryContext context);
 static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
@@ -704,266 +704,10 @@ AllocSetDelete(MemoryContext context)
 	free(set);
 }
 
-/*
- * AllocSetAlloc
- *		Returns pointer to allocated memory of given size or NULL if
- *		request could not be completed; memory is added to the set.
- *
- * No request may exceed:
- *		MAXALIGN_DOWN(SIZE_MAX) - ALLOC_BLOCKHDRSZ - ALLOC_CHUNKHDRSZ
- * All callers use a much-lower limit.
- *
- * Note: when using valgrind, it doesn't matter how the returned allocation
- * is marked, as mcxt.c will set it to UNDEFINED.  In some paths we will
- * return space that is marked NOACCESS - AllocSetRealloc has to beware!
- */
-static void *
-AllocSetAlloc(MemoryContext context, Size size)
+static inline void *
+AllocSetAllocReturnChunk(AllocSet set, Size size, AllocChunk chunk, Size chunk_size)
 {
-	AllocSet	set = (AllocSet) context;
-	AllocBlock	block;
-	AllocChunk	chunk;
-	int			fidx;
-	Size		chunk_size;
-	Size		blksize;
-
-	AssertArg(AllocSetIsValid(set));
-
-	/*
-	 * If requested size exceeds maximum for chunks, allocate an entire block
-	 * for this request.
-	 */
-	if (size > set->allocChunkLimit)
-	{
-		chunk_size = MAXALIGN(size);
-		blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
-		block = (AllocBlock) malloc(blksize);
-		if (block == NULL)
-			return NULL;
-
-		context->mem_allocated += blksize;
-
-		block->aset = set;
-		

Re: slab allocator performance issues

2021-07-18 Thread Tomas Vondra

On 7/18/21 3:06 AM, Andres Freund wrote:

Hi,

On 2021-07-17 16:10:19 -0700, Andres Freund wrote:

Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.

I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).


Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.

Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).

But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...



Sounds great! Thanks for investigating this and for the improvements.

It might be good to do some experiments to see how the changes affect 
memory consumption for practical workloads. I'm willing to spend soem 
time on that, if needed.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-17 16:10:19 -0700, Andres Freund wrote:
> Instead of populating a linked list with all chunks upon creation of a block -
> which requires touching a fair bit of memory - keep a per-block pointer (or an
> offset) into "unused" area of the block. When allocating from the block and
> theres still "unused" memory left, use that, instead of bothering with the
> freelist.
> 
> I tried that, and it nearly got slab up to the allocation/freeing performance
> of aset.c (while winning after allocation, due to the higher memory density).

Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.

Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).

But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...

Greetings,

Andres Freund




Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote:
> On 7/17/21 11:14 PM, Andres Freund wrote:
> > Hm. I wonder if we should just not populate the freelist eagerly, to
> > drive down the initialization cost. I.e. have a separate allocation path
> > for chunks that have never been allocated, by having a
> > SlabBlock->free_offset or such.
> > 
> > Sure, it adds a branch to the allocation happy path, but it also makes the
> > first allocation for a chunk cheaper, because there's no need to get the 
> > next
> > element from the freelist (adding a likely cache miss). And it should make 
> > the
> > allocation of a new block faster by a lot.
> > 
> 
> Not sure what you mean by 'not populate eagerly' so can't comment :-(

Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.

I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).


> > > > 4) Less of a performance, and more of a usability issue: The constant
> > > > block size strikes me as problematic. Most users of an allocator can
> > > > sometimes be used with a small amount of data, and sometimes with a
> > > > large amount.
> > > > 
> > > 
> > > I doubt this is worth the effort, really. The constant block size makes
> > > various places much simpler (both to code and reason about), so this 
> > > should
> > > not make a huge difference in performance. And IMHO the block size is 
> > > mostly
> > > an implementation detail, so I don't see that as a usability issue.
> > 
> > Hm? It's something the user has to specify, so I it's not really an
> > implementation detail. It needs to be specified without sufficient
> > information, as well, since externally one doesn't know how much memory the
> > block header and chunk headers + rounding up will use, so computing a good
> > block size isn't easy. I've wondered whether it should just be a count...
> > 
> 
> I think this is mixing two problems - how to specify the block size, and
> whether the block size is constant (as in slab) or grows over time (as in
> allocset).

That was in response to the "implementation detail" bit solely.


> But growing the block size seems problematic for long-lived contexts with
> workloads that change a lot over time - imagine e.g. decoding many small
> transactions, with one huge transaction mixed in. The one huge transaction
> will grow the block size, and we'll keep using it forever. But in that case
> we might have just as well allocate the large blocks from the start, I
> guess.

I was thinking of capping the growth fairly low. I don't think after a 16x
growth or so you're likely to still see allocation performance gains with
slab. And I don't think that'd be too bad for decoding - we'd start with a
small initial block size, and in many workloads that will be enough, and just
workloads where that doesn't suffice will adapt performance wise.  And: Medium
term I wouldn't expect reorderbuffer.c to stay the only slab.c user...


> > Why do you not think it's relevant for performance? Either one causes too 
> > much
> > memory usage by using a too large block size, wasting memory, or one ends up
> > loosing perf through frequent allocations?
>
> True. I simply would not expect this to make a huge difference - I may be
> wrong, and I'm sure there are workloads where it matters. But I still think
> it's easier to just use larger blocks than to make the slab code more
> complex.

IDK. I'm looking at using slab as part of a radix tree implementation right
now. Which I'd hope to be used in various different situations. So it's hard
to choose the right block size - and it does seem to matter for performance.

Greetings,

Andres Freund




Re: slab allocator performance issues

2021-07-17 Thread Tomas Vondra




On 7/17/21 11:14 PM, Andres Freund wrote:

Hi,

On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:

On 7/17/21 9:43 PM, Andres Freund wrote:

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?



+1

I think it makes perfect sense to not free the blocks immediately, and keep
one (or a small number) as a cache. I'm not sure why we decided not to have
a "keeper" block, but I suspect memory consumption was my main concern at
that point. But I never expected the cost to be this high.


I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?



I think the portion of watermark would be problematic for cases with one 
huge transaction - that'll set a high watermark, and we'll keep way too 
many free blocks. But the portion of in use blocks might work, I think.



Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.

Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.



Not sure what you mean by 'not populate eagerly' so can't comment :-(




2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).



Hmm, I think you're right we could simply use the pointers, but I have not
tried that.


I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.



Understood. I didn't expect this to be a full solution.




3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?



Yeah. The purpose of organizing the freelists like this is to prioritize the
"more full" blocks" when allocating new chunks, in the hope that the "less
full" blocks will end up empty and freed faster.

But this is naturally imprecise, and strongly depends on the workload, of
course, and I bet for most cases a less precise approach would work just as
fine.

I'm not sure how exactly would the upper limit you propose work, but perhaps
we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
So after the alloc/free we'd calculate the new freelist index as (nfree/16)
and only moved the block if the index changed. This would reduce the
overhead to 1/16 and probably even more in practice.



Of course, we could also say we have e.g. 8 freelists and work the ranges
backwards from that, I 

Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
> On 7/17/21 9:43 PM, Andres Freund wrote:
> > 1) If allocations are short-lived slab.c, can end up constantly
> > freeing/initializing blocks. Which requires fairly expensively iterating 
> > over
> > all potential chunks in the block and initializing it. Just to then free 
> > that
> > memory again after a small number of allocations. The extreme case of this 
> > is
> > when there are phases of alloc/free of a single allocation.
> > 
> > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
> > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course 
> > that
> > only works if the problem is with the only, so it's not a great
> > approach. Perhaps just keeping the last allocated block around would work?
> > 
> 
> +1
> 
> I think it makes perfect sense to not free the blocks immediately, and keep
> one (or a small number) as a cache. I'm not sure why we decided not to have
> a "keeper" block, but I suspect memory consumption was my main concern at
> that point. But I never expected the cost to be this high.

I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?

Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.

Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.


> > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, 
> > taking
> > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, 
> > it's
> > still slow enough there to be very worrisome.
> > 
> > I don't see a way to get around the division while keeping the freelist
> > structure as is. But:
> > 
> > ISTM that we only need the index because of the free-chunk list, right? Why
> > don't we make the chunk list use actual pointers? Is it concern that that'd
> > increase the minimum allocation size?  If so, I see two ways around that:
> > First, we could make the index just the offset from the start of the block,
> > that's much cheaper to calculate. Second, we could store the next pointer in
> > SlabChunk->slab/block instead (making it a union) - while on the freelist we
> > don't need to dereference those, right?
> > 
> > I suspect both would also make the block initialization a bit cheaper.
> > 
> > That should also accelerate SlabBlockGetChunk(), which currently shows up as
> > an imul, which isn't exactly fast either (and uses a lot of execution 
> > ports).
> > 
> 
> Hmm, I think you're right we could simply use the pointers, but I have not
> tried that.

I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.



> > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
> > shows up prominently in profiles. That's ~tripling the number of cachelines
> > touched in the happy path, with unpredictable accesses to boot.
> > 
> > Perhaps we could reduce the precision of slab->freelist indexing to amortize
> > that cost? I.e. not have one slab->freelist entry for each nfree, but 
> > instead
> > have an upper limit on the number of freelists?
> > 
> 
> Yeah. The purpose of organizing the freelists like this is to prioritize the
> "more full" blocks" when allocating new chunks, in the hope that the "less
> full" blocks will end up empty and freed faster.
> 
> But this is naturally imprecise, and strongly depends on the workload, of
> course, and I bet for most cases a less precise approach would work just as
> fine.
> 
> I'm not sure how exactly would the upper limit you propose work, but perhaps
> we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
> So after the alloc/free we'd calculate the new freelist index as (nfree/16)
> and only moved the block if the index changed. This would reduce the
> overhead to 1/16 and probably even more in practice.

> Of course, we could also say we have e.g. 8 freelists and work the ranges
> backwards from that, I guess that's what you propose.

Yea, I was thinking something along those lines. As you say, ow about there's
always at most 8 freelists or such. During 

Re: slab allocator performance issues

2021-07-17 Thread Tomas Vondra

Hi,

On 7/17/21 9:43 PM, Andres Freund wrote:

Hi,

I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.

But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.

slab:
NOTICE:  0: 1 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880
   Overhead  Command   Shared Object Symbol

+   28.27%  postgres  postgres  [.] SlabAlloc
+9.64%  postgres  bdbench.so[.] bfm_delete
+9.03%  postgres  bdbench.so[.] bfm_set
+8.39%  postgres  bdbench.so[.] bfm_lookup
+8.36%  postgres  bdbench.so[.] bfm_set_leaf.constprop.0
+8.16%  postgres  libc-2.31.so  [.] __memmove_avx_unaligned_erms
+6.88%  postgres  bdbench.so[.] bfm_delete_leaf
+3.24%  postgres  libc-2.31.so  [.] _int_malloc
+2.58%  postgres  bdbench.so[.] bfm_tests
+2.33%  postgres  postgres  [.] SlabFree
+1.29%  postgres  libc-2.31.so  [.] _int_free
+1.09%  postgres  libc-2.31.so  [.] unlink_chunk.constprop.0

aset:

NOTICE:  0: 1 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880

+   16.43%  postgres  bdbench.so[.] bfm_lookup
+   15.38%  postgres  bdbench.so[.] bfm_delete
+   12.82%  postgres  libc-2.31.so  [.] __memmove_avx_unaligned_erms
+   12.65%  postgres  bdbench.so[.] bfm_set
+   12.15%  postgres  bdbench.so[.] bfm_set_leaf.constprop.0
+   10.57%  postgres  bdbench.so[.] bfm_delete_leaf
+4.05%  postgres  bdbench.so[.] bfm_tests
+2.93%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+1.59%  postgres  postgres  [.] AllocSetAlloc
+1.15%  postgres  bdbench.so[.] memmove@plt
+1.06%  postgres  bdbench.so[.] bfm_grow_leaf_16

OS:
NOTICE:  0: 1 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880


That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...


This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?



+1

I think it makes perfect sense to not free the blocks immediately, and 
keep one (or a small number) as a cache. I'm not sure why we decided not 
to have a "keeper" block, but I suspect memory consumption was my main 
concern at that point. But I never expected the cost to be this high.




2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).



Hmm, I think you're right we could simply use the pointers, but I have 
not tried that.




3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the 

Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-17 12:43:33 -0700, Andres Freund wrote:
> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, 
> taking
> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
> still slow enough there to be very worrisome.
> 
> I don't see a way to get around the division while keeping the freelist
> structure as is. But:
> 
> ISTM that we only need the index because of the free-chunk list, right? Why
> don't we make the chunk list use actual pointers? Is it concern that that'd
> increase the minimum allocation size?  If so, I see two ways around that:
> First, we could make the index just the offset from the start of the block,
> that's much cheaper to calculate. Second, we could store the next pointer in
> SlabChunk->slab/block instead (making it a union) - while on the freelist we
> don't need to dereference those, right?
> 
> I suspect both would also make the block initialization a bit cheaper.
> 
> That should also accelerate SlabBlockGetChunk(), which currently shows up as
> an imul, which isn't exactly fast either (and uses a lot of execution ports).

Oh - I just saw that effectively the allocation size already is a
uintptr_t at minimum. I had only seen

/* Make sure the linked list node fits inside a freed chunk */
if (chunkSize < sizeof(int))
chunkSize = sizeof(int);
but it's followed by
/* chunk, including SLAB header (both addresses nicely aligned) */
fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize);

which means we are reserving enough space for a pointer on just about
any platform already? Seems we can just make that official and reserve
space for a pointer as part of the chunk size rounding up, instead of
fullChunkSize?

Greetings,

Andres Freund