Hi hackers,

While debugging some slow queries containing Bitmap Heap/Index Scans (in
short BHS / BIS), we observed a few issues regarding scalability:

   1. The BIS always only runs in a single process, also when the parent
   BHS is parallel. The first process arriving in the BHS serves as leader and
   executes the BIS.
   2. As long as execution is "exact" (TIDs are stored instead of page
   bits), the parallel BHS sorts all TIDs to ensure pages are accessed
   sequentially. The sort is also performed just by a single worker. Already
   with a few tens of thousands of pages to scan, the sort time can make up a
   significant portion of the total runtime. Large page counts and the need
   for parallelism are not uncommon for BHS, as one use case is closing the
   gap between index and sequential scans. The BHS costing seems to not
   account for that.
   3. The BHS does not scale well with an increasing number of parallel
   workers, even when accounting for the sequential parts of execution. A perf
   profile shows that the TID list / bitmap iteration code heavily contents on
   a mutex taken for every single TID / page bit (see
   LWLockAcquire(&istate->lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
   4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the
   statistics of the parallel workers. For example the number of heap pages
   processed is what just the leader did. Similarly to other parallel plan
   nodes we should aggregate statistics across workers.

The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for
different numbers of workers. I had to obfuscate the query slightly. The
difference between the startup time of the BHS and the BIS is the time it
takes to sort the TID list. The self time of the BHS is just the time spent
on processing the shared TID list and processing the pages. That part runs
in parallel but does not scale.

Workers | Total runtime | Startup time BIS | Startup time BHS | Self time
BHS (excl. sorting)
-------------------------------------------|------------------|------------------------------
2       | 15322 ms      | 3107 ms          | 5912 ms          | 9269 ms
4       | 13277 ms      | 3094 ms          | 5869 ms          | 7260 ms
8       | 14628 ms      | 3106 ms          | 5882 ms          | 8598 ms

None of this is really new and some of it is even documented. So, what I am
more wondering about is why things are the way they are and how hard it
would be to change them. I am especially curious about:

   - What stops us from extending the BIS to run in parallel? Parallel
   Bitmap Index Scans are also supported.
   - What about reducing the sort time by, e.g.
      - dividing TIDs across workers, ending up with N parallely sorted
      streams,
      - cooperatively sorting the TIDs with multiple workers using barriers
      for synchronization,
      - optimizing the PagetableEntry data structure for size and using a
      faster sorting algorithm like e.g. radix sort
      - a combination of the first three options
   - With separate TID lists per worker process the iteration problem would
   be solved. Otherwise, we could
      - optimize the iteration code and thereby minimize the duration of
      the critical section,
      - have worker processes acquire chunks of TIDs / page bits to reduce
      locking.

Is there interest in patches improving on the above mentioned shortcomings?
If so, which options do you deem best?

--
David Geier
(ServiceNow)



-- 2 workers

 Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=15187.942..15321.345 rows=2 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual time=15181.486..15181.488 rows=1
loops=2)
               Output: PARTIAL count(*)
               Worker 0:  actual time=15181.364..15181.366 rows=1 loops=1
               Worker 1:  actual time=15181.608..15181.610 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on foo (actual
time=5912.731..15166.992 rows=269713 loops=2)
                     Filter: ...
                     Rows Removed by Filter: 4020149
                     Worker 0:  actual time=5912.498..15166.936 rows=269305
loops=1
                     Worker 1:  actual time=5912.963..15167.048 rows=270121
loops=1
                     ->  Bitmap Index Scan on foo_idx (actual
time=3107.947..3107.948 rows=8579724 loops=1)
                           Index Cond: -
                           Worker 1:  actual time=3107.947..3107.948
rows=8579724 loops=1
 Planning Time: 0.167 ms
 Execution Time: 15322.081 ms


-- 4 workers

 Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=13137.981..13276.403 rows=4 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 4
         Workers Launched: 4
         ->  Partial Aggregate (actual time=13130.344..13130.346 rows=1
loops=4)
               Output: PARTIAL count(*)
               Worker 0:  actual time=13129.363..13129.365 rows=1 loops=1
               Worker 1:  actual time=13130.085..13130.087 rows=1 loops=1
               Worker 2:  actual time=13130.634..13130.635 rows=1 loops=1
               Worker 3:  actual time=13131.295..13131.298 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on foo (actual
time=5870.026..13120.579 rows=134856 loops=4)
                     Filter: ...
                     Rows Removed by Filter: 2010074
                     Worker 0:  actual time=5869.033..13120.453 rows=128270
loops=1
                     Worker 1:  actual time=5869.698..13118.811 rows=135333
loops=1
                     Worker 2:  actual time=5870.465..13121.189 rows=137695
loops=1
                     Worker 3:  actual time=5870.907..13121.864 rows=138128
loops=1
                     ->  Bitmap Index Scan on foo_idx (actual
time=3094.585..3094.586 rows=8579724 loops=1)
                           Index Cond: -
                           Worker 3:  actual time=3094.585..3094.586
rows=8579724 loops=1
 Planning Time: 0.146 ms
 Execution Time: 13277.315 ms

-- 8 workers

 Finalize Aggregate (actual time=14533.688..14627.962 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=14492.463..14627.950 rows=8 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 8
         Workers Launched: 8
         ->  Partial Aggregate (actual time=14483.059..14483.061 rows=1
loops=8)
               Output: PARTIAL count(*)
               Worker 0:  actual time=14480.058..14480.061 rows=1 loops=1
               Worker 1:  actual time=14480.948..14480.950 rows=1 loops=1
               Worker 2:  actual time=14481.668..14481.670 rows=1 loops=1
               Worker 3:  actual time=14482.829..14482.832 rows=1 loops=1
               Worker 4:  actual time=14483.695..14483.697 rows=1 loops=1
               Worker 5:  actual time=14484.290..14484.293 rows=1 loops=1
               Worker 6:  actual time=14485.166..14485.168 rows=1 loops=1
               Worker 7:  actual time=14485.819..14485.821 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on foo (actual
time=5886.191..14477.239 rows=67428 loops=8)
                     Filter: ...
                     Rows Removed by Filter: 1005037
                     Worker 0:  actual time=5882.909..14474.627 rows=60325
loops=1
                     Worker 1:  actual time=5883.788..14474.945 rows=69459
loops=1
                     Worker 2:  actual time=5884.475..14475.735 rows=69686
loops=1
                     Worker 3:  actual time=5886.149..14477.162 rows=64680
loops=1
                     Worker 4:  actual time=5886.987..14477.653 rows=71034
loops=1
                     Worker 5:  actual time=5887.347..14478.667 rows=65836
loops=1
                     Worker 6:  actual time=5888.978..14479.239 rows=67755
loops=1
                     Worker 7:  actual time=5888.896..14479.886 rows=70651
loops=1
                     ->  Bitmap Index Scan on foo_idx (actual
time=3106.840..3106.840 rows=8579724 loops=1)
                           Index Cond: -
                           Worker 7:  actual time=3106.840..3106.840
rows=8579724 loops=1
 Planning Time: 0.150 ms
 Execution Time: 14628.648 ms

Reply via email to