Re: PG15 beta1 sort performance regression due to Generation context change

Ranier Vilela Mon, 23 May 2022 15:02:30 -0700

Em seg., 23 de mai. de 2022 às 18:01, Tomas Vondra <
[email protected]> escreveu:


> FYI not sure why, but your responses seem to break threading quite
> often, due to missing headers identifying the message you're responding
> to (In-Reply-To, References). Not sure why or how to fix it, but this
> makes it much harder to follow the discussion.
>
> On 5/22/22 21:11, Ranier Vilela wrote:
> > Hi David,
> >
> >>Over the past few days I've been gathering some benchmark results
> >>together to show the sort performance improvements in PG15 [1].
> >
> >>One of the test cases I did was to demonstrate Heikki's change to use
> >>a k-way merge (65014000b).
> >
> >>The test I did to try this out was along the lines of:
> >
> >>set max_parallel_workers_per_gather = 0;
> >>create table t (a bigint not null, b bigint not null, c bigint not
> >>null, d bigint not null, e bigint not null, f bigint not null);
> >
> >>insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x;
> > -- 10GB!
> >>vacuum freeze t;
> >
> >>The query I ran was:
> >
> >>select * from t order by a offset 140247142;
> >
> > I redid this test here:
> > Windows 10 64 bits
> > msvc 2019 64 bits
> > RAM 8GB
> > SSD 256 GB
> >
> > HEAD (default configuration)
> > Time: 229396,551 ms (03:49,397)
> > PATCHED:
> > Time: 220887,346 ms (03:40,887)
> >
>
> This is 10x longer than reported by David. Presumably David used a
> machine a lot of RAM, while your system has 8GB and so is I/O bound.
>
Probably, but Windows is slower than Linux, certainly.

>
> Also, what exactly does "patched" mean? The patch you attached?
>
It means the results of the benchmark with the patch applied.


>
> >>I tested various sizes of work_mem starting at 4MB and doubled that
> >>all the way to 16GB. For many of the smaller values of work_mem the
> >>performance is vastly improved by Heikki's change, however for
> >>work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9
> >>seconds and PG15 beta 1 took 29 seconds!
> >
> >>I've been trying to get to the bottom of this today and finally have
> >>discovered this is due to the tuple size allocations in the sort being
> >>exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts
> >>to store tuples in sorts) the tuple for the sort would be stored in an
> >>aset context. After 40af10b57 we'll use a generation context. The
> >>idea with that change is that the generation context does no
> >>power-of-2 round ups for allocations, so we save memory in most cases.
> >>However, due to this particular test having a tuple size of 64-bytes,
> >>there was no power-of-2 wastage with aset.
> >
> >>The problem is that generation chunks have a larger chunk header than
> >>aset do due to having to store the block pointer that the chunk
> >>belongs to so that GenerationFree() can increment the nfree chunks in
> >>the block. aset.c does not require this as freed chunks just go onto a
> >>freelist that's global to the entire context.
> >
> >>Basically, for my test query, the slowdown is because instead of being
> >>able to store 620702 tuples per tape over 226 tapes with an aset
> >>context, we can now only store 576845 tuples per tape resulting in
> >>requiring 244 tapes when using the generation context.
> >
> >>If I had added column "g" to make the tuple size 72 bytes causing
> >>aset's code to round allocations up to 128 bytes and generation.c to
> >>maintain the 72 bytes then the sort would have stored 385805 tuples
> >>over 364 batches for aset and 538761 tuples over 261 batches using the
> >>generation context. That would have been a huge win.
> >
> >>So it basically looks like I discovered a very bad case that causes a
> >>significant slowdown. Yet other cases that are not an exact power of
> >>2 stand to gain significantly from this change.
> >
> >>One thing 40af10b57 does is stops those terrible performance jumps
> >>when the tuple size crosses a power-of-2 boundary. The performance
> >>should be more aligned to the size of the data being sorted now...
> >>Unfortunately, that seems to mean regressions for large sorts with
> >>power-of-2 sized tuples.
> >
> > It seems to me that the solution would be to use aset allocations
> >
> > when the size of the tuples is power-of-2?
> >
> > if (state->sortopt & TUPLESORT_ALLOWBOUNDED ||
> > (state->memtupsize & (state->memtupsize - 1)) == 0)
> > state->tuplecontext = AllocSetContextCreate(state->sortcontext,
> > "Caller tuples",  ALLOCSET_DEFAULT_SIZES);
> > else
> > state->tuplecontext = GenerationContextCreate(state->sortcontext,
> >  "Caller tuples",  ALLOCSET_DEFAULT_SIZES);
> >
>
> I'm pretty sure this is pointless, because memtupsize is the size of the
> memtuples array. But the issue is about size of the tuples. After all,
> David was talking about 64B chunks, but the array is always at least
> 1024 elements, so it obviously can't be the same thing.
>
It was more of a guessing attempt.


> How would we even know how large the tuples will be at this point,
> before we even see the first of them?
>
I don't know how.


> > I took a look and tried some improvements to see if I had a better
> result.
> >
>
> IMHO special-casing should be the last resort, because it makes the
> behavior much harder to follow.

Probably, but in my tests, there has been some gain.

Also, we're talking about sort, but
> don't other places using Generation context have the same issue?
>
Well, for PG15 is what was addressed, just sort.


> Treating prefferrable to  find a fix addressing all those places,
>
For now, I'm just focused on sorts.

regards,
Ranier Vilela

Re: PG15 beta1 sort performance regression due to Generation context change

Reply via email to