Hi David, >Over the past few days I've been gathering some benchmark results >together to show the sort performance improvements in PG15 [1].
>One of the test cases I did was to demonstrate Heikki's change to use >a k-way merge (65014000b). >The test I did to try this out was along the lines of: >set max_parallel_workers_per_gather = 0; >create table t (a bigint not null, b bigint not null, c bigint not >null, d bigint not null, e bigint not null, f bigint not null); >insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB! >vacuum freeze t; >The query I ran was: >select * from t order by a offset 140247142; I redid this test here: Windows 10 64 bits msvc 2019 64 bits RAM 8GB SSD 256 GB HEAD (default configuration) Time: 229396,551 ms (03:49,397) PATCHED: Time: 220887,346 ms (03:40,887) >I tested various sizes of work_mem starting at 4MB and doubled that >all the way to 16GB. For many of the smaller values of work_mem the >performance is vastly improved by Heikki's change, however for >work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9 >seconds and PG15 beta 1 took 29 seconds! >I've been trying to get to the bottom of this today and finally have >discovered this is due to the tuple size allocations in the sort being >exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts >to store tuples in sorts) the tuple for the sort would be stored in an >aset context. After 40af10b57 we'll use a generation context. The >idea with that change is that the generation context does no >power-of-2 round ups for allocations, so we save memory in most cases. >However, due to this particular test having a tuple size of 64-bytes, >there was no power-of-2 wastage with aset. >The problem is that generation chunks have a larger chunk header than >aset do due to having to store the block pointer that the chunk >belongs to so that GenerationFree() can increment the nfree chunks in >the block. aset.c does not require this as freed chunks just go onto a >freelist that's global to the entire context. >Basically, for my test query, the slowdown is because instead of being >able to store 620702 tuples per tape over 226 tapes with an aset >context, we can now only store 576845 tuples per tape resulting in >requiring 244 tapes when using the generation context. >If I had added column "g" to make the tuple size 72 bytes causing >aset's code to round allocations up to 128 bytes and generation.c to >maintain the 72 bytes then the sort would have stored 385805 tuples >over 364 batches for aset and 538761 tuples over 261 batches using the >generation context. That would have been a huge win. >So it basically looks like I discovered a very bad case that causes a >significant slowdown. Yet other cases that are not an exact power of >2 stand to gain significantly from this change. >One thing 40af10b57 does is stops those terrible performance jumps >when the tuple size crosses a power-of-2 boundary. The performance >should be more aligned to the size of the data being sorted now... >Unfortunately, that seems to mean regressions for large sorts with >power-of-2 sized tuples. It seems to me that the solution would be to use aset allocations when the size of the tuples is power-of-2? if (state->sortopt & TUPLESORT_ALLOWBOUNDED || (state->memtupsize & (state->memtupsize - 1)) == 0) state->tuplecontext = AllocSetContextCreate(state->sortcontext, "Caller tuples", ALLOCSET_DEFAULT_SIZES); else state->tuplecontext = GenerationContextCreate(state->sortcontext, "Caller tuples", ALLOCSET_DEFAULT_SIZES); I took a look and tried some improvements to see if I had a better result. Would you mind taking a look and testing? regards, Ranier Vilela
CREATE TABLE t_test (x numeric);
INSERT INTO t_test SELECT random()
FROM generate_series(1, 5000000);
ANALYZE;
SHOW work_mem;
HEAD:
postgres=# explain analyze SELECT * FROM t_test ORDER BY x;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=397084.73..883229.71 rows=4166666 width=11) (actual
time=1326.281..2718.040 rows=5000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=396084.71..401293.04 rows=2083333 width=11) (actual
time=1287.168..1520.520 rows=1666667 loops=3)
Sort Key: x
Sort Method: external merge Disk: 24880kB
Worker 0: Sort Method: external merge Disk: 24776kB
Worker 1: Sort Method: external merge Disk: 23960kB
-> Parallel Seq Scan on t_test (cost=0.00..47861.33 rows=2083333
width=11) (actual time=0.241..135.730 rows=1666667 loops=3)
Planning Time: 0.054 ms
Execution Time: 2837.789 ms
(11 rows)
PATCHED:
postgres=# explain analyze SELECT * FROM t_test ORDER BY x;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=397084.73..883229.71 rows=4166666 width=11) (actual
time=1283.818..2696.469 rows=5000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=396084.71..401293.04 rows=2083333 width=11) (actual
time=1253.459..1485.734 rows=1666667 loops=3)
Sort Key: x
Sort Method: external merge Disk: 25136kB
Worker 0: Sort Method: external merge Disk: 24096kB
Worker 1: Sort Method: external merge Disk: 24304kB
-> Parallel Seq Scan on t_test (cost=0.00..47861.33 rows=2083333
width=11) (actual time=0.256..133.065 rows=1666667 loops=3)
Planning Time: 0.055 ms
Execution Time: 2816.495 ms
(11 rows)
David's benchmark:
set max_parallel_workers_per_gather = 0;
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);
insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB!
vacuum freeze t;
select * from t order by a offset 140247142;
HEAD:
postgres=# select * from t order by a offset 140247142;
a | b | c | d | e | f
---+---+---+---+---+---
(0 rows)
Time: 229396,551 ms (03:49,397)
Time: 231432,750 ms (03:51,433)
PATCHED:
postgres=# explain analyze select * from t order by a offset 140247142;
a | b | c | d | e | f
---+---+---+---+---+---
(0 rows)
Time: 220887,346 ms (03:40,887)
Time: 222652,487 ms (03:42,652)
Ronan's benchmark:
Setup 1: single table, 1 000 000 tuples, no index
CREATE TABLE tbench (
a int,
b int
);
INSERT INTO tbench (a, b) SELECT a, b FROM generate_series(1, 100) a,
generate_series(1, 10000) b;
Test 1: Single-column ordered select (order by b since the table is already
sorted by a)
select b from tbench order by b;
HEAD:
postgres=# explain analyze select b from tbench order by b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Sort (cost=266422.45..271422.45 rows=2000000 width=4) (actual
time=425.062..559.464 rows=2000000 loops=1)
Sort Key: b
Sort Method: external merge Disk: 23528kB
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual
time=0.056..168.422 rows=2000000 loops=1)
Planning Time: 0.058 ms
Execution Time: 605.315 ms
(6 rows)
PATCHED:
postgres=# explain analyze select b from tbench order by b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Sort (cost=266422.45..271422.45 rows=2000000 width=4) (actual
time=421.642..557.750 rows=2000000 loops=1)
Sort Key: b
Sort Method: external merge Disk: 23528kB
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual
time=0.047..167.848 rows=2000000 loops=1)
Planning Time: 0.080 ms
Execution Time: 603.583 ms
(6 rows)
Test 2: Ordered sum (using b so that the input is not presorted)
select sum(b order by b) from tbench;
HEAD:
postgres=# explain analyze select sum(b order by b) from tbench;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual
time=547.316..547.317 rows=1 loops=1)
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual
time=0.074..134.147 rows=2000000 loops=1)
Planning Time: 0.060 ms
Execution Time: 547.339 ms
(4 rows)
PATCHED:
postgres=# explain analyze select sum(b order by b) from tbench;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual
time=544.565..544.566 rows=1 loops=1)
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual
time=0.044..134.305 rows=2000000 loops=1)
Planning Time: 0.050 ms
Execution Time: 544.633 ms
(4 rows)
Test 3: Ordered sum + group by
select b, sum(a order by a) from tbench GROUP BY b;
HEAD:
postgres=# explain analyze select b, sum(a order by a) from tbench GROUP BY b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=266422.45..281522.48 rows=10003 width=12) (actual
time=740.557..1201.470 rows=10000 loops=1)
Group Key: b
-> Sort (cost=266422.45..271422.45 rows=2000000 width=8) (actual
time=740.490..905.244 rows=2000000 loops=1)
Sort Key: b
Sort Method: external merge Disk: 35216kB
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8)
(actual time=0.040..192.818 rows=2000000 loops=1)
Planning Time: 0.080 ms
Execution Time: 1206.248 ms
(8 rows)
PATCHED:
postgres=# explain analyze select b, sum(a order by a) from tbench GROUP BY b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=266422.45..281522.48 rows=10003 width=12) (actual
time=744.886..1196.399 rows=10000 loops=1)
Group Key: b
-> Sort (cost=266422.45..271422.45 rows=2000000 width=8) (actual
time=744.822..906.223 rows=2000000 loops=1)
Sort Key: b
Sort Method: external merge Disk: 35216kB
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8)
(actual time=0.053..198.255 rows=2000000 loops=1)
Planning Time: 0.084 ms
Execution Time: 1200.807 ms
(8 rows)
Setup 2: same as before, but adding an index on (b, a)
CREATE INDEX ON tbench (b, a);
Test 2: Ordered sum:
select sum(a order by a) from tbench;
HEAD:
postgres=# explain analyze select sum(a order by a) from tbench;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual
time=417.679..417.679 rows=1 loops=1)
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual
time=0.036..134.878 rows=2000000 loops=1)
Planning Time: 1.185 ms
Execution Time: 417.732 ms
(4 rows)
PATCHED:
postgres=# explain analyze select sum(a order by a) from tbench;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual
time=421.826..421.827 rows=1 loops=1)
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual
time=0.063..134.843 rows=2000000 loops=1)
Planning Time: 0.070 ms
Execution Time: 421.885 ms
(4 rows)
Test 3: Ordered sum + group by:
select a, sum(b order by b) from tbench GROUP BY a;
HEAD:
postgres=# explain analyze select a, sum(b order by b) from tbench GROUP BY a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=199979.56..214980.56 rows=100 width=12) (actual
time=502.326..1048.130 rows=100 loops=1)
Group Key: a
-> Sort (cost=199979.56..204979.56 rows=2000000 width=8) (actual
time=496.952..654.468 rows=2000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 35216kB
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8)
(actual time=0.064..136.186 rows=2000000 loops=1)
Planning Time: 0.137 ms
Execution Time: 1052.026 ms
(8 rows)
PATCHED:
postgres=# explain analyze select a, sum(b order by b) from tbench GROUP BY a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=199979.56..214980.56 rows=100 width=12) (actual
time=485.994..1033.928 rows=100 loops=1)
Group Key: a
-> Sort (cost=199979.56..204979.56 rows=2000000 width=8) (actual
time=480.617..637.878 rows=2000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 35216kB
-> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8)
(actual time=0.066..136.371 rows=2000000 loops=1)
Planning Time: 0.117 ms
Execution Time: 1037.586 ms
(8 rows)
002-improve-sort.patch
Description: Binary data
