Hi, A while back I did some benchmarking on a big 4 socket machine to explore a bit around the outer limits of parallel aggregates. I discovered along the way that, given enough workers, and a simple enough task, that seq-scan workers were held up waiting for the lock to be released in heap_parallelscan_nextpage().
I've since done a little work in this area to try to improve things. I ended up posting about it yesterday in [1]. My original patch used batching to solve the issue; instead of allocating 1 block at a time, the batching patch allocated a range of 10 blocks for the worker to process. However, the implementation still needed a bit of work around reporting sync-scan locations. Andres mentioned in [2] that it might be worth exploring using atomics to do the same job. So I went ahead and did that, and came up with the attached, which is a slight variation on what he mentioned in the thread. To keep things a bit more simple, and streamline, I ended up pulling out the logic for setting the startblock into another function, which we only call once before the first call to heap_parallelscan_nextpage(). I also ended up changing phs_cblock and replacing it with a counter that always starts at zero. The actual block is calculated based on that + the startblock modulo nblocks. This makes things a good bit more simple for detecting when we've allocated all the blocks to the workers, and also works nicely when wrapping back to the start of a relation when we started somewhere in the middle due to piggybacking with a synchronous scan. Performance: With parallel_workers=71, it looks something like: Query 1: 881 GB, ~6 billion row TPC-H lineitem table. tpch=# select count(*) from lineitem; count ------------ 5999989709 (1 row) -- Master Time: 123421.283 ms (02:03.421) Time: 118895.846 ms (01:58.896) Time: 118632.546 ms (01:58.633) -- Atomics patch Time: 74038.813 ms (01:14.039) Time: 73166.200 ms (01:13.166) Time: 72492.338 ms (01:12.492) -- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage() Time: 76364.215 ms (01:16.364) Time: 75808.900 ms (01:15.809) Time: 74927.756 ms (01:14.928) Query 2: Single int column table with 2 billion rows. tpch=# select count(*) from a; count ------------ 2000000000 (1 row) -- Master Time: 5853.918 ms (00:05.854) Time: 5925.633 ms (00:05.926) Time: 5859.223 ms (00:05.859) -- Atomics patch Time: 5825.745 ms (00:05.826) Time: 5849.139 ms (00:05.849) Time: 5815.818 ms (00:05.816) -- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage() Time: 5789.237 ms (00:05.789) Time: 5837.395 ms (00:05.837) Time: 5821.492 ms (00:05.821) I've also attached a text file with the perf report for the lineitem query. You'll notice that the heap_parallelscan_nextpage() is very visible in master, but not on each of the two patches. With the 2nd query, heap_parallelscan_nextpage is fairly insignificant on master's profile, it's only showing up as 0.48%. Likely this must be due to more tuples being read from the page, and more aggregation work getting done before the next page is needed. I'm uncertain why I previously saw a speed up in this case in [1]. I've also noticed that both the atomics patch and unpatched master do something that looks a bit weird with synchronous seq-scans. If the parallel seq-scan piggybacked on another scan, then subsequent parallel scans will start at the same non-zero block location, even when no other concurrent scans exist. I'd have expected this should go back to block 0 again, but maybe I'm just failing to understand the reason for reporting the startblock to ss_report_location() at the end of the scan. I'll now add this to the first commitfest of pg11. I just wanted to note that I've done this, so that it's less likely someone else goes and repeats the same work. [1] https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/20170505023646.3uhnmf2hbwtm63lc%40alap3.anarazel.de -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
-- Unpatched select count(*) from lineitem; 71 workers
Children Self Command Shared Object Symbol
+ 99.83% 0.00% postgres libpthread-2.17.so [.] __restore_rt
+ 99.83% 0.00% postgres postgres [.] sigusr1_handler
+ 99.83% 0.00% postgres postgres [.] maybe_start_bgworkers
+ 99.83% 0.00% postgres postgres [.] do_start_bgworker
+ 99.83% 0.93% postgres postgres [.] ExecProcNode
+ 99.83% 0.00% postgres postgres [.] StartBackgroundWorker
+ 99.83% 0.00% postgres postgres [.] ParallelWorkerMain
+ 99.83% 0.00% postgres postgres [.] ParallelQueryMain
+ 99.83% 0.00% postgres postgres [.] ExecutorRun
+ 99.83% 0.00% postgres postgres [.] standard_ExecutorRun
+ 99.83% 0.00% postgres postgres [.] ExecutePlan
+ 99.83% 0.00% postgres postgres [.] ExecAgg
+ 99.69% 0.62% postgres postgres [.] agg_retrieve_direct
+ 91.28% 0.63% postgres postgres [.] fetch_input_tuple
+ 89.75% 0.17% postgres postgres [.] ExecSeqScan
+ 89.41% 1.07% postgres postgres [.] ExecScan
+ 84.46% 0.24% postgres postgres [.] ExecScanFetch
+ 84.26% 0.54% postgres postgres [.] SeqNext
+ 83.05% 0.44% postgres postgres [.] heap_getnext
+ 82.61% 2.00% postgres postgres [.] heapgettup_pagemode
+ 57.81% 0.84% postgres postgres [.]
heap_parallelscan_nextpage
+ 53.46% 43.38% postgres postgres [.] s_lock
+ 21.82% 1.24% postgres postgres [.] heapgetpage
+ 20.69% 1.16% postgres postgres [.] ReadBufferExtended
+ 19.52% 0.04% postgres postgres [.] ReadBuffer_common
+ 13.75% 0.00% postgres [kernel.kallsyms] [k] system_call_fastpath
+ 11.54% 0.08% postgres postgres [.] BufferAlloc
+ 8.21% 8.16% postgres postgres [.] tas
+ 7.71% 0.03% postgres postgres [.] smgrread
+ 7.68% 0.02% postgres postgres [.] mdread
+ 7.22% 1.84% postgres postgres [.] advance_aggregates
+ 7.21% 0.03% postgres libpthread-2.17.so [.] __read_nocancel
+ 7.05% 0.02% postgres [kernel.kallsyms] [k] sys_read
+ 7.01% 0.02% postgres [kernel.kallsyms] [k] vfs_read
+ 6.88% 0.04% postgres [kernel.kallsyms] [k] do_sync_read
+ 6.85% 0.08% postgres [xfs] [k] xfs_file_aio_read
+ 6.39% 0.01% postgres [kernel.kallsyms] [k] sys_futex
+ 6.39% 0.01% postgres [kernel.kallsyms] [k] do_futex
+ 6.27% 0.07% postgres [kernel.kallsyms] [k] generic_file_aio_read
+ 5.82% 0.11% postgres [kernel.kallsyms] [k] get_futex_key
+ 5.15% 0.07% postgres postgres [.] LWLockAcquire
+ 4.82% 4.82% postgres [kernel.kallsyms] [k]
copy_user_enhanced_fast_string
+ 4.52% 4.52% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 3.92% 0.09% postgres libpthread-2.17.so [.] sem_wait
+ 3.80% 0.02% postgres [kernel.kallsyms] [k] futex_wait
+ 3.61% 0.02% postgres [kernel.kallsyms] [k] futex_wait_setup
+ 3.34% 1.49% postgres postgres [.] ExecProject
+ 3.31% 3.29% postgres postgres [.] tas
+ 3.27% 0.10% postgres postgres [.] LWLockRelease
+ 3.04% 0.01% postgres [kernel.kallsyms] [k] __lock_page
+ 3.02% 0.09% postgres [kernel.kallsyms] [k] __wait_on_bit_lock
+ 3.01% 1.09% postgres postgres [.] ExecProject
+ 2.79% 0.06% postgres postgres [.] LWLockWakeup
+ 2.67% 0.06% postgres libpthread-2.17.so [.] sem_post
+ 2.58% 0.04% postgres [kernel.kallsyms] [k] futex_wake
+ 2.35% 0.02% postgres [kernel.kallsyms] [k] unlock_page
+ 2.31% 0.05% postgres [kernel.kallsyms] [k] __wake_up_bit
+ 2.26% 0.01% postgres [kernel.kallsyms] [k] __wake_up
+ 2.25% 0.09% postgres [kernel.kallsyms] [k]
prepare_to_wait_exclusive
+ 2.23% 1.48% postgres postgres [.]
hash_search_with_hash_value
+ 2.01% 1.01% postgres postgres [.]
advance_transition_function
+ 1.85% 1.83% postgres postgres [.] ExecClearTuple
+ 1.28% 0.98% postgres postgres [.] perform_spin_delay
+ 1.14% 0.10% postgres postgres [.] LWLockAttemptLock
-- Atomics patch select count(*) from lineitem; 71 workers
+ 99.58% 0.00% postgres postgres [.] StartBackgroundWorker
+ 99.58% 0.00% postgres postgres [.] ParallelWorkerMain
+ 99.58% 2.22% postgres postgres [.] ExecProcNode
+ 99.58% 0.00% postgres postgres [.] ParallelQueryMain
+ 99.58% 0.00% postgres postgres [.] ExecutorRun
+ 99.58% 0.00% postgres postgres [.] standard_ExecutorRun
+ 99.58% 0.00% postgres postgres [.] ExecutePlan
+ 99.58% 0.00% postgres postgres [.] ExecAgg
+ 99.33% 1.26% postgres postgres [.] agg_retrieve_direct
+ 82.33% 1.41% postgres postgres [.] fetch_input_tuple
+ 78.65% 0.37% postgres postgres [.] ExecSeqScan
+ 77.94% 1.99% postgres postgres [.] ExecScan
+ 69.07% 0.47% postgres postgres [.] ExecScanFetch
+ 68.65% 1.10% postgres postgres [.] SeqNext
+ 66.15% 0.89% postgres postgres [.] heap_getnext
+ 65.28% 4.12% postgres postgres [.] heapgettup_pagemode
+ 60.01% 1.99% postgres postgres [.] heapgetpage
+ 56.58% 0.07% postgres postgres [.] ReadBufferExtended
+ 56.49% 0.11% postgres postgres [.] ReadBuffer_common
+ 44.37% 0.00% postgres [kernel.kallsyms] [k] system_call_fastpath
+ 41.13% 0.18% postgres postgres [.] BufferAlloc
+ 30.34% 0.02% postgres [kernel.kallsyms] [k] sys_futex
+ 30.31% 0.02% postgres [kernel.kallsyms] [k] do_futex
+ 28.19% 0.50% postgres [kernel.kallsyms] [k] get_futex_key
+ 21.93% 21.93% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 21.42% 0.15% postgres postgres [.] LWLockAcquire
+ 18.82% 0.33% postgres libpthread-2.17.so [.] sem_wait
+ 18.36% 0.08% postgres [kernel.kallsyms] [k] futex_wait
+ 17.67% 0.05% postgres [kernel.kallsyms] [k] futex_wait_setup
+ 15.32% 0.05% postgres [kernel.kallsyms] [k] __lock_page
+ 15.24% 0.44% postgres [kernel.kallsyms] [k] __wait_on_bit_lock
+ 14.76% 0.06% postgres postgres [.] smgrread
+ 14.70% 0.07% postgres postgres [.] mdread
+ 14.63% 3.64% postgres postgres [.] advance_aggregates
+ 13.69% 0.05% postgres libpthread-2.17.so [.] __read_nocancel
+ 13.60% 0.22% postgres postgres [.] LWLockRelease
+ 13.38% 0.05% postgres [kernel.kallsyms] [k] sys_read
+ 13.27% 0.06% postgres [kernel.kallsyms] [k] vfs_read
+ 13.01% 0.08% postgres [kernel.kallsyms] [k] do_sync_read
+ 12.94% 0.13% postgres [xfs] [k] xfs_file_aio_read
+ 12.68% 0.21% postgres postgres [.] LWLockWakeup
+ 12.27% 0.21% postgres libpthread-2.17.so [.] sem_post
+ 11.93% 0.14% postgres [kernel.kallsyms] [k] futex_wake
+ 11.87% 0.13% postgres [kernel.kallsyms] [k] generic_file_aio_read
+ 11.53% 0.48% postgres [kernel.kallsyms] [k]
prepare_to_wait_exclusive
+ 11.17% 0.07% postgres [kernel.kallsyms] [k] unlock_page
+ 11.06% 0.38% postgres [kernel.kallsyms] [k] __wake_up_bit
+ 10.66% 0.02% postgres [kernel.kallsyms] [k] __wake_up
+ 9.02% 9.02% postgres [kernel.kallsyms] [k]
copy_user_enhanced_fast_string
+ 6.13% 2.10% postgres postgres [.] ExecProject
+ 5.87% 2.14% postgres postgres [.] ExecProject
+ 4.41% 2.95% postgres postgres [.]
hash_search_with_hash_value
+ 4.16% 2.17% postgres postgres [.]
advance_transition_function
+ 3.59% 3.56% postgres postgres [.] ExecClearTuple
+ 2.38% 0.01% postgres [kernel.kallsyms] [k] finish_wait
+ 2.33% 0.78% postgres postgres [.]
ExecEvalExprSwitchContext
+ 2.16% 0.21% postgres postgres [.] LWLockAttemptLock
+ 2.03% 2.01% postgres postgres [.] ExecInterpExpr
+ 1.95% 0.64% postgres postgres [.]
ExecEvalExprSwitchContext
+ 1.83% 1.82% postgres postgres [.] MemoryContextReset
+ 1.76% 0.02% postgres postgres [.] BufTableLookup
+ 1.75% 0.91% postgres [kernel.kallsyms] [k] __wake_up_common
+ 1.73% 1.72% postgres postgres [.] int8inc
+ 1.64% 1.01% postgres [kernel.kallsyms] [k] __find_get_page
-- More details on that spin lock.
- 21.93% 21.93% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave
▒
- _raw_spin_lock_irqsave
▒
- 48.37% prepare_to_wait_exclusive
▒
__wait_on_bit_lock
▒
__lock_page
▒
+ get_futex_key
▒
+ 39.15% __wake_up
▒
+ 10.37% finish_wait
▒
+ 1.18% pagevec_lru_move_fn
▒
+ 0.77% try_to_wake_up
-- Batch patch, 10 pages at a time, 71 workers.
+ 99.45% 1.91% postgres postgres [.] ExecProcNode
+ 99.45% 0.00% postgres postgres [.] ParallelWorkerMain
+ 99.45% 0.00% postgres postgres [.] ParallelQueryMain
+ 99.45% 0.00% postgres postgres [.] ExecutorRun
+ 99.45% 0.00% postgres postgres [.] standard_ExecutorRun
+ 99.45% 0.00% postgres postgres [.] ExecutePlan
+ 99.45% 0.00% postgres postgres [.] ExecAgg
+ 99.10% 1.18% postgres postgres [.] agg_retrieve_direct
+ 81.94% 1.42% postgres postgres [.] fetch_input_tuple
+ 78.62% 0.35% postgres postgres [.] ExecSeqScan
+ 77.88% 2.02% postgres postgres [.] ExecScan
+ 69.05% 0.53% postgres postgres [.] ExecScanFetch
+ 68.55% 1.17% postgres postgres [.] SeqNext
+ 66.03% 0.93% postgres postgres [.] heap_getnext
+ 65.10% 4.06% postgres postgres [.] heapgettup_pagemode
+ 60.60% 2.02% postgres postgres [.] heapgetpage
+ 57.07% 0.09% postgres postgres [.] ReadBufferExtended
+ 56.96% 0.10% postgres postgres [.] ReadBuffer_common
+ 45.67% 0.01% postgres [kernel.kallsyms] [k] system_call_fastpath
+ 42.58% 0.19% postgres postgres [.] BufferAlloc
+ 32.18% 0.02% postgres [kernel.kallsyms] [k] sys_futex
+ 32.16% 0.02% postgres [kernel.kallsyms] [k] do_futex
+ 30.04% 0.28% postgres [kernel.kallsyms] [k] get_futex_key
+ 23.99% 23.99% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 22.88% 0.15% postgres postgres [.] LWLockAcquire
+ 20.13% 0.30% postgres libpthread-2.17.so [.] sem_wait
+ 19.71% 0.08% postgres [kernel.kallsyms] [k] futex_wait
+ 19.05% 0.05% postgres [kernel.kallsyms] [k] futex_wait_setup
+ 15.98% 0.06% postgres [kernel.kallsyms] [k] __lock_page
+ 15.90% 0.44% postgres [kernel.kallsyms] [k] __wait_on_bit_lock
+ 14.70% 3.71% postgres postgres [.] advance_aggregates
+ 13.94% 0.21% postgres postgres [.] LWLockRelease
+ 13.84% 0.09% postgres postgres [.] smgrread
+ 13.74% 0.05% postgres postgres [.] mdread
+ 13.42% 0.06% postgres libpthread-2.17.so [.] __read_nocancel
+ 13.09% 0.19% postgres postgres [.] LWLockWakeup
+ 12.96% 0.05% postgres [kernel.kallsyms] [k] sys_read
+ 12.77% 0.08% postgres [kernel.kallsyms] [k] vfs_read
+ 12.71% 0.17% postgres libpthread-2.17.so [.] sem_post
+ 12.50% 0.08% postgres [kernel.kallsyms] [k] do_sync_read
+ 12.44% 0.15% postgres [xfs] [k] xfs_file_aio_read
+ 12.42% 0.15% postgres [kernel.kallsyms] [k] futex_wake
+ 12.42% 0.10% postgres [kernel.kallsyms] [k] unlock_page
+ 12.29% 0.42% postgres [kernel.kallsyms] [k] __wake_up_bit
+ 12.28% 0.48% postgres [kernel.kallsyms] [k]
prepare_to_wait_exclusive
+ 11.85% 0.02% postgres [kernel.kallsyms] [k] __wake_up
+ 11.31% 0.14% postgres [kernel.kallsyms] [k] generic_file_aio_read
+ 8.83% 8.83% postgres [kernel.kallsyms] [k]
copy_user_enhanced_fast_string
+ 6.25% 1.77% postgres postgres [.] ExecProject
+ 5.87% 2.12% postgres postgres [.] ExecProject
+ 3.96% 2.63% postgres postgres [.]
hash_search_with_hash_value
+ 3.88% 2.61% postgres postgres [.]
advance_transition_function
+ 3.53% 3.51% postgres postgres [.] ExecClearTuple
+ 2.81% 2.79% postgres postgres [.] ExecInterpExpr
+ 2.79% 0.69% postgres postgres [.]
ExecEvalExprSwitchContext
+ 2.24% 0.02% postgres [kernel.kallsyms] [k] finish_wait
+ 2.16% 0.22% postgres postgres [.] LWLockAttemptLock
+ 2.07% 2.06% postgres postgres [.] MemoryContextReset
+ 2.02% 0.63% postgres postgres [.]
ExecEvalExprSwitchContext
+ 1.59% 0.80% postgres [kernel.kallsyms] [k] __wake_up_common
+ 1.55% 1.54% postgres postgres [.] pg_atomic_read_u32_impl
+ 1.52% 0.04% postgres postgres [.] pg_atomic_read_u32
+ 1.51% 0.02% postgres postgres [.] BufTableLookup
+ 1.45% 0.07% postgres postgres [.] LockBufHdr
parallel_nextpage_atomics.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
