Em seg., 23 de mai. de 2022 às 18:01, Tomas Vondra < tomas.von...@enterprisedb.com> escreveu:
> FYI not sure why, but your responses seem to break threading quite > often, due to missing headers identifying the message you're responding > to (In-Reply-To, References). Not sure why or how to fix it, but this > makes it much harder to follow the discussion. > > On 5/22/22 21:11, Ranier Vilela wrote: > > Hi David, > > > >>Over the past few days I've been gathering some benchmark results > >>together to show the sort performance improvements in PG15 [1]. > > > >>One of the test cases I did was to demonstrate Heikki's change to use > >>a k-way merge (65014000b). > > > >>The test I did to try this out was along the lines of: > > > >>set max_parallel_workers_per_gather = 0; > >>create table t (a bigint not null, b bigint not null, c bigint not > >>null, d bigint not null, e bigint not null, f bigint not null); > > > >>insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; > > -- 10GB! > >>vacuum freeze t; > > > >>The query I ran was: > > > >>select * from t order by a offset 140247142; > > > > I redid this test here: > > Windows 10 64 bits > > msvc 2019 64 bits > > RAM 8GB > > SSD 256 GB > > > > HEAD (default configuration) > > Time: 229396,551 ms (03:49,397) > > PATCHED: > > Time: 220887,346 ms (03:40,887) > > > > This is 10x longer than reported by David. Presumably David used a > machine a lot of RAM, while your system has 8GB and so is I/O bound. > Probably, but Windows is slower than Linux, certainly. > > Also, what exactly does "patched" mean? The patch you attached? > It means the results of the benchmark with the patch applied. > > >>I tested various sizes of work_mem starting at 4MB and doubled that > >>all the way to 16GB. For many of the smaller values of work_mem the > >>performance is vastly improved by Heikki's change, however for > >>work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9 > >>seconds and PG15 beta 1 took 29 seconds! > > > >>I've been trying to get to the bottom of this today and finally have > >>discovered this is due to the tuple size allocations in the sort being > >>exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts > >>to store tuples in sorts) the tuple for the sort would be stored in an > >>aset context. After 40af10b57 we'll use a generation context. The > >>idea with that change is that the generation context does no > >>power-of-2 round ups for allocations, so we save memory in most cases. > >>However, due to this particular test having a tuple size of 64-bytes, > >>there was no power-of-2 wastage with aset. > > > >>The problem is that generation chunks have a larger chunk header than > >>aset do due to having to store the block pointer that the chunk > >>belongs to so that GenerationFree() can increment the nfree chunks in > >>the block. aset.c does not require this as freed chunks just go onto a > >>freelist that's global to the entire context. > > > >>Basically, for my test query, the slowdown is because instead of being > >>able to store 620702 tuples per tape over 226 tapes with an aset > >>context, we can now only store 576845 tuples per tape resulting in > >>requiring 244 tapes when using the generation context. > > > >>If I had added column "g" to make the tuple size 72 bytes causing > >>aset's code to round allocations up to 128 bytes and generation.c to > >>maintain the 72 bytes then the sort would have stored 385805 tuples > >>over 364 batches for aset and 538761 tuples over 261 batches using the > >>generation context. That would have been a huge win. > > > >>So it basically looks like I discovered a very bad case that causes a > >>significant slowdown. Yet other cases that are not an exact power of > >>2 stand to gain significantly from this change. > > > >>One thing 40af10b57 does is stops those terrible performance jumps > >>when the tuple size crosses a power-of-2 boundary. The performance > >>should be more aligned to the size of the data being sorted now... > >>Unfortunately, that seems to mean regressions for large sorts with > >>power-of-2 sized tuples. > > > > It seems to me that the solution would be to use aset allocations > > > > when the size of the tuples is power-of-2? > > > > if (state->sortopt & TUPLESORT_ALLOWBOUNDED || > > (state->memtupsize & (state->memtupsize - 1)) == 0) > > state->tuplecontext = AllocSetContextCreate(state->sortcontext, > > "Caller tuples", ALLOCSET_DEFAULT_SIZES); > > else > > state->tuplecontext = GenerationContextCreate(state->sortcontext, > > "Caller tuples", ALLOCSET_DEFAULT_SIZES); > > > > I'm pretty sure this is pointless, because memtupsize is the size of the > memtuples array. But the issue is about size of the tuples. After all, > David was talking about 64B chunks, but the array is always at least > 1024 elements, so it obviously can't be the same thing. > It was more of a guessing attempt. > How would we even know how large the tuples will be at this point, > before we even see the first of them? > I don't know how. > > I took a look and tried some improvements to see if I had a better > result. > > > > IMHO special-casing should be the last resort, because it makes the > behavior much harder to follow. Probably, but in my tests, there has been some gain. Also, we're talking about sort, but > don't other places using Generation context have the same issue? > Well, for PG15 is what was addressed, just sort. > Treating prefferrable to find a fix addressing all those places, > For now, I'm just focused on sorts. regards, Ranier Vilela