Hi hackers,

While migrating from PostgreSQL 14 to 15, we encountered the following performance degradation caused by commit 46846433a03dff: "shm_mq: Update mq_bytes_written less often", discussion in [1].

The batching can make queries with a LIMIT clause run significantly slower compared to PostgreSQL 14, because neither the ring buffer write position is updated, nor the latch to inform the leader that there's data available is set, before a worker's queue is 1/4th full. This can be seen in the number of rows produced by a parallel worker. Worst-case, the data set is large and all rows to answer the query appear early, but are not big enough to fill the queue to 1/4th (e.g. when the LIMIT and the tuple sizes are small). Here is an example to reproduce the problem.

CREATE TABLE t(id1 INT, id2 INT, id3 INT, id4 INT, id5 INT);
INSERT INTO t(id1, id2, id3, id4, id5) SELECT i%1000, i, i, i, i FROM generate_series(1, 10000000) AS i;
ANALYZE t;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT id2 FROM t WHERE id1 = 100 LIMIT 100;

PostgreSQL 15:

 Limit  (cost=0.00..797.43 rows=100 width=4) (actual time=65.083..69.207 rows=100 loops=1)
   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9947 width=4) (actual time=65.073..68.417 rows=100 loops=1)
         Output: id2
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 rows=1243 width=4) (actual time=0.204..33.049 rows=100 loops=7)
               Output: id2
               Filter: (t.id1 = 100)
               Rows Removed by Filter: 99345
               Worker 0:  actual time=0.334..32.284 rows=100 loops=1
               Worker 1:  actual time=0.060..32.680 rows=100 loops=1
               Worker 2:  actual time=0.637..33.954 rows=98 loops=1
               Worker 3:  actual time=0.136..33.301 rows=100 loops=1
               Worker 4:  actual time=0.140..31.942 rows=100 loops=1
               Worker 5:  actual time=0.062..33.673 rows=100 loops=1
               Worker 6:  actual time=0.062..33.512 rows=100 loops=1
 Planning Time: 0.113 ms
 Execution Time: 69.772 ms

PostgreSQL 14:

 Limit  (cost=0.00..797.75 rows=100 width=4) (actual time=30.602..38.459 rows=100 loops=1)
   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9943 width=4) (actual time=30.592..37.669 rows=100 loops=1)
         Output: id2
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 rows=1243 width=4) (actual time=0.221..5.181 rows=15 loops=7)
               Output: id2
               Filter: (t.id1 = 100)
               Rows Removed by Filter: 15241
               Worker 0:  actual time=0.129..4.840 rows=15 loops=1
               Worker 1:  actual time=0.125..4.924 rows=15 loops=1
               Worker 2:  actual time=0.314..5.249 rows=17 loops=1
               Worker 3:  actual time=0.252..5.341 rows=15 loops=1
               Worker 4:  actual time=0.163..5.179 rows=15 loops=1
               Worker 5:  actual time=0.422..5.248 rows=15 loops=1
               Worker 6:  actual time=0.139..5.489 rows=16 loops=1
 Planning Time: 0.084 ms
 Execution Time: 38.880 ms

I had a quick look at the code and I started wondering if we can't achieve the same performance improvement without batching by e.g.:

- Only set the latch if new data is written to an empty queue. Otherwise, the leader should anyways keep try reading from the queues without waiting for the latch, so no need to set the latch again.

- Reorganize struct shm_mq. There seems to be false sharing happening between at least mq_ring_size and the atomics and potentially also between the atomics. I'm wondering if the that's not the root cause of the "slow atomics" observed in [1]? I'm happy to do some profiling.

Alternatively, we could always set the latch if numberTuples in ExecutePlan() is reasonably low. To do so, the DestReceiver's receive() method would only need an additional "force flush" argument.


A slightly different but related problem is when some workers have already produced enough rows to answer the LIMIT query, but other workers are still running without producing any new rows. In that case the "already done" workers will stop running even though they haven't reached 1/4th of the queue size, because the for-loop in execMain.c bails out in the following condition:

        if (numberTuples && numberTuples == current_tuple_count)
            break;

Subsequently, the leader will end the plan and then wait in the Gather node for all workers to shutdown. However, workers still running but not producing any new rows will never reach the following condition in execMain.c to check if they're supposed to stop (the shared memory queue dest receiver will return false on detached queues):

            /*
             * If we are not able to send the tuple, we assume the destination              * has closed and no more tuples can be sent. If that's the case,
             * end the loop.
             */
            if (!dest->receiveSlot(slot, dest))
                break;

Reproduction steps for this problem are below. Here the worker getting the first table page will be done right away, but the query takes as long as it takes to scan all pages of the entire table.

CREATE TABLE bar (col INT);
INSERT INTO bar SELECT generate_series(1, 5000000);
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT col FROM bar WHERE col = 1 LIMIT 1;

 Limit  (cost=0.00..1.10 rows=1 width=4) (actual time=32.289..196.200 rows=1 loops=1)
   Output: col
   ->  Gather  (cost=0.00..30939.03 rows=28208 width=4) (actual time=32.278..196.176 rows=1 loops=1)
         Output: col
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on public.bar (cost=0.00..30939.03 rows=3526 width=4) (actual time=137.251..137.255 rows=0 loops=7)
               Output: col
               Filter: (bar.col = 1)
               Rows Removed by Filter: 713769
               Worker 0:  actual time=160.177..160.181 rows=0 loops=1
               Worker 1:  actual time=160.111..160.115 rows=0 loops=1
               Worker 2:  actual time=0.043..0.047 rows=1 loops=1
               Worker 3:  actual time=160.040..160.044 rows=0 loops=1
               Worker 4:  actual time=160.167..160.171 rows=0 loops=1
               Worker 5:  actual time=160.018..160.022 rows=0 loops=1
               Worker 6:  actual time=160.201..160.205 rows=0 loops=1
 Planning Time: 0.087 ms
 Execution Time: 196.247 ms

We would need something similar to CHECK_FOR_INTERRUPTS() which returns a NULL slot if a parallel worker is supposed to stop execution (we could e.g. check if the queue got detached). Or could we amend CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue got detached?

Jasper Smit, Spiros Agathos and Dimos Stamatakis helped working on this.

[1] https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com

--
David Geier
(ServiceNow)


Reply via email to