On Sat, 4 May 2024 at 06:55, Cary Huang <cary.hu...@highgo.ca> wrote: > With syncscan enabled, the "table_block_parallelscan_nextpage()" would > return the next block since the end of the first tid rangescan instead of the > correct start block that should be scanned. I see that single tid rangescan > does not have SO_ALLOW_SYNC set, so I figure syncscan should also be > disabled in parallel case. With this change, then it would be okay to call > heap_setscanlimits() in parallel case, so I added this call back to > heap_set_tidrange() in both serial and parallel cases.
This now calls heap_setscanlimits() for the parallel version, it's just that table_block_parallelscan_nextpage() does nothing to obey those limits. The only reason the code isn't reading the entire table is due to the optimisation in heap_getnextslot_tidrange() which returns false when the ctid goes out of range. i.e, this code: /* * When scanning forward, the TIDs will be in ascending order. * Future tuples in this direction will be higher still, so we can * just return false to indicate there will be no more tuples. */ if (ScanDirectionIsForward(direction)) return false; If I comment out that line, I see all pages are accessed: postgres=# explain (analyze, buffers) select count(*) from a where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=18.80..18.81 rows=1 width=8) (actual time=33.530..36.118 rows=1 loops=1) Buffers: shared read=4425 -> Gather (cost=18.78..18.79 rows=2 width=8) (actual time=33.456..36.102 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=4425 -> Partial Aggregate (cost=18.78..18.79 rows=1 width=8) (actual time=20.389..20.390 rows=1 loops=3) Buffers: shared read=4425 -> Parallel Tid Range Scan on a (cost=0.01..16.19 rows=1035 width=0) (actual time=9.375..20.349 rows=829 loops=3) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared read=4425 <---- this is all pages in the table instead of 11 pages. With that code still commented out, the non-parallel version still won't read all pages due to the setscanlimits being obeyed. postgres=# set max_parallel_workers_per_gather=0; SET postgres=# explain (analyze, buffers) select count(*) from a where ctid >= '(0,1)' and ctid < '(11,0)'; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Aggregate (cost=45.07..45.08 rows=1 width=8) (actual time=0.302..0.302 rows=1 loops=1) Buffers: shared hit=11 -> Tid Range Scan on a (cost=0.01..38.86 rows=2485 width=0) (actual time=0.019..0.188 rows=2486 loops=1) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=11 If I put that code back in, how many pages are read depends on the number of parallel workers as workers will keep running with higher page numbers and heap_getnextslot_tidrange() will just (inefficiently) filter those out. max_parallel_workers_per_gather=2; -> Parallel Tid Range Scan on a (cost=0.01..16.19 rows=1035 width=0) (actual time=0.191..0.310 rows=829 loops=3) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared read=17 max_parallel_workers_per_gather=3; -> Parallel Tid Range Scan on a (cost=0.01..12.54 rows=802 width=0) (actual time=0.012..0.114 rows=622 loops=4) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=19 max_parallel_workers_per_gather=4; -> Parallel Tid Range Scan on a (cost=0.01..9.72 rows=621 width=0) (actual time=0.014..0.135 rows=497 loops=5) TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid)) Buffers: shared hit=21 To fix this you need to make table_block_parallelscan_nextpage obey the limits imposed by heap_setscanlimits(). The equivalent code in the non-parallel version is in heapgettup_advance_block(). /* check if the limit imposed by heap_setscanlimits() is met */ if (scan->rs_numblocks != InvalidBlockNumber) { if (--scan->rs_numblocks == 0) return InvalidBlockNumber; } I've not studied exactly how you'd get the rs_numblock information down to the parallel scan descriptor. But when you figure that out, just remember that you can't do the --scan->rs_numblocks from table_block_parallelscan_nextpage() as that's not parallel safe. You might be able to add an or condition to: "if (nallocated >= pbscan->phs_nblocks)" to make it "if (nallocated >= pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)", although the field names don't seem very intuitive there. It would be nicer if the HeapScanDesc field was called rs_blocklimit rather than rs_numblocks. It's not for this patch to go messing with that, however. David