Performance issues with parallelism and LIMIT

David Geier Wed, 01 Feb 2023 05:41:24 -0800

Hi hackers,

While migrating from PostgreSQL 14 to 15, we encountered the followingperformance degradation caused by commit 46846433a03dff: "shm_mq: Updatemq_bytes_written less often", discussion in [1].

The batching can make queries with a LIMIT clause run significantlyslower compared to PostgreSQL 14, because neither the ring buffer writeposition is updated, nor the latch to inform the leader that there'sdata available is set, before a worker's queue is 1/4th full. This canbe seen in the number of rows produced by a parallel worker. Worst-case,the data set is large and all rows to answer the query appear early, butare not big enough to fill the queue to 1/4th (e.g. when the LIMIT andthe tuple sizes are small). Here is an example to reproduce the problem.


CREATE TABLE t(id1 INT, id2 INT, id3 INT, id4 INT, id5 INT);

INSERT INTO t(id1, id2, id3, id4, id5) SELECT i%1000, i, i, i, i FROMgenerate_series(1, 10000000) AS i;

ANALYZE t;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT id2 FROM t WHERE id1 = 100 LIMIT 100;

PostgreSQL 15:

Limit (cost=0.00..797.43 rows=100 width=4) (actualtime=65.083..69.207 rows=100 loops=1)

   Output: id2

-> Gather (cost=0.00..79320.18 rows=9947 width=4) (actualtime=65.073..68.417 rows=100 loops=1)

         Output: id2
         Workers Planned: 8
         Workers Launched: 7

-> Parallel Seq Scan on public.t (cost=0.00..79320.18rows=1243 width=4) (actual time=0.204..33.049 rows=100 loops=7)

               Output: id2
               Filter: (t.id1 = 100)
               Rows Removed by Filter: 99345
               Worker 0:  actual time=0.334..32.284 rows=100 loops=1
               Worker 1:  actual time=0.060..32.680 rows=100 loops=1
               Worker 2:  actual time=0.637..33.954 rows=98 loops=1
               Worker 3:  actual time=0.136..33.301 rows=100 loops=1
               Worker 4:  actual time=0.140..31.942 rows=100 loops=1
               Worker 5:  actual time=0.062..33.673 rows=100 loops=1
               Worker 6:  actual time=0.062..33.512 rows=100 loops=1
 Planning Time: 0.113 ms
 Execution Time: 69.772 ms

PostgreSQL 14:

Limit (cost=0.00..797.75 rows=100 width=4) (actualtime=30.602..38.459 rows=100 loops=1)

   Output: id2

-> Gather (cost=0.00..79320.18 rows=9943 width=4) (actualtime=30.592..37.669 rows=100 loops=1)

         Output: id2
         Workers Planned: 8
         Workers Launched: 7

-> Parallel Seq Scan on public.t (cost=0.00..79320.18rows=1243 width=4) (actual time=0.221..5.181 rows=15 loops=7)

               Output: id2
               Filter: (t.id1 = 100)
               Rows Removed by Filter: 15241
               Worker 0:  actual time=0.129..4.840 rows=15 loops=1
               Worker 1:  actual time=0.125..4.924 rows=15 loops=1
               Worker 2:  actual time=0.314..5.249 rows=17 loops=1
               Worker 3:  actual time=0.252..5.341 rows=15 loops=1
               Worker 4:  actual time=0.163..5.179 rows=15 loops=1
               Worker 5:  actual time=0.422..5.248 rows=15 loops=1
               Worker 6:  actual time=0.139..5.489 rows=16 loops=1
 Planning Time: 0.084 ms
 Execution Time: 38.880 ms

I had a quick look at the code and I started wondering if we can'tachieve the same performance improvement without batching by e.g.:

- Only set the latch if new data is written to an empty queue.Otherwise, the leader should anyways keep try reading from the queueswithout waiting for the latch, so no need to set the latch again.

- Reorganize struct shm_mq. There seems to be false sharing happeningbetween at least mq_ring_size and the atomics and potentially alsobetween the atomics. I'm wondering if the that's not the root cause ofthe "slow atomics" observed in [1]? I'm happy to do some profiling.

Alternatively, we could always set the latch if numberTuples inExecutePlan() is reasonably low. To do so, the DestReceiver's receive()method would only need an additional "force flush" argument.

A slightly different but related problem is when some workers havealready produced enough rows to answer the LIMIT query, but otherworkers are still running without producing any new rows. In that casethe "already done" workers will stop running even though they haven'treached 1/4th of the queue size, because the for-loop in execMain.cbails out in the following condition:


        if (numberTuples && numberTuples == current_tuple_count)
            break;

Subsequently, the leader will end the plan and then wait in the Gathernode for all workers to shutdown. However, workers still running but notproducing any new rows will never reach the following condition inexecMain.c to check if they're supposed to stop (the shared memory queuedest receiver will return false on detached queues):

/*

* If we are not able to send the tuple, we assume thedestination * has closed and no more tuples can be sent. If that's thecase,

             * end the loop.
             */
            if (!dest->receiveSlot(slot, dest))
                break;

Reproduction steps for this problem are below. Here the worker gettingthe first table page will be done right away, but the query takes aslong as it takes to scan all pages of the entire table.


CREATE TABLE bar (col INT);
INSERT INTO bar SELECT generate_series(1, 5000000);
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT col FROM bar WHERE col = 1 LIMIT 1;

Limit (cost=0.00..1.10 rows=1 width=4) (actual time=32.289..196.200rows=1 loops=1)

   Output: col

-> Gather (cost=0.00..30939.03 rows=28208 width=4) (actualtime=32.278..196.176 rows=1 loops=1)

         Output: col
         Workers Planned: 8
         Workers Launched: 7

-> Parallel Seq Scan on public.bar (cost=0.00..30939.03rows=3526 width=4) (actual time=137.251..137.255 rows=0 loops=7)

               Output: col
               Filter: (bar.col = 1)
               Rows Removed by Filter: 713769
               Worker 0:  actual time=160.177..160.181 rows=0 loops=1
               Worker 1:  actual time=160.111..160.115 rows=0 loops=1
               Worker 2:  actual time=0.043..0.047 rows=1 loops=1
               Worker 3:  actual time=160.040..160.044 rows=0 loops=1
               Worker 4:  actual time=160.167..160.171 rows=0 loops=1
               Worker 5:  actual time=160.018..160.022 rows=0 loops=1
               Worker 6:  actual time=160.201..160.205 rows=0 loops=1
 Planning Time: 0.087 ms
 Execution Time: 196.247 ms

We would need something similar to CHECK_FOR_INTERRUPTS() which returnsa NULL slot if a parallel worker is supposed to stop execution (we coulde.g. check if the queue got detached). Or could we amendCHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queuegot detached?


Jasper Smit, Spiros Agathos and Dimos Stamatakis helped working on this.

[1]https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com


--
David Geier
(ServiceNow)

Performance issues with parallelism and LIMIT

Reply via email to