On 2017-11-05 01:05:59 +0100, Robert Haas wrote: > skip-gather-project-v1.patch does what it says on the tin. I still > don't have a test case for this, and I didn't find that it helped very > much, but it would probably help more in a test case with more > columns, and you said this looked like a big bottleneck in your > testing, so here you go.
The query where that showed a big benefit was SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1; (i.e a not very selective filter, and then just throwing the results away) still shows quite massive benefits: before: set parallel_setup_cost=0;set parallel_tuple_cost=0;set min_parallel_table_scan_size=0;set max_parallel_workers_per_gather=8; tpch_5[17938][1]=# explain analyze SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ QUERY PLAN ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Limit (cost=635802.67..635802.69 rows=1 width=127) (actual time=8675.097..8675.097 rows=0 loops=1) │ -> Gather (cost=0.00..635802.67 rows=27003243 width=127) (actual time=0.289..7904.849 rows=26989780 loops=1) │ Workers Planned: 8 │ Workers Launched: 7 │ -> Parallel Seq Scan on lineitem (cost=0.00..635802.67 rows=3375405 width=127) (actual time=0.124..528.667 rows=3373722 loops=8) │ Filter: (l_suppkey > 5012) │ Rows Removed by Filter: 376252 │ Planning time: 0.098 ms │ Execution time: 8676.125 ms └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (9 rows) after: tpch_5[19754][1]=# EXPLAIN ANALYZE SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ QUERY PLAN ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Limit (cost=635802.67..635802.69 rows=1 width=127) (actual time=5984.916..5984.916 rows=0 loops=1) │ -> Gather (cost=0.00..635802.67 rows=27003243 width=127) (actual time=0.214..5123.238 rows=26989780 loops=1) │ Workers Planned: 8 │ Workers Launched: 7 │ -> Parallel Seq Scan on lineitem (cost=0.00..635802.67 rows=3375405 width=127) (actual time=0.025..649.887 rows=3373722 loops=8) │ Filter: (l_suppkey > 5012) │ Rows Removed by Filter: 376252 │ Planning time: 0.076 ms │ Execution time: 5986.171 ms └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (9 rows) so there clearly is still benefit (this is scale 100, but that shouldn't make much of a difference). Did not review the code. > shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only > consume input from the shared queue when the amount of unconsumed > input exceeds 1/4 of the queue size. This caused a large performance > improvement in my testing because it causes the number of times the > latch gets set to drop dramatically. I experimented a bit with > thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be > enough to capture most of the benefit. Hm. Is consuming the relevant part, or notifying the sender about it? I suspect most of the benefit can be captured by updating bytes read (and similarly on the other side w/ bytes written), but not setting the latch unless thresholds are reached. The advantage of updating the value, even without notifying the other side, is that in the common case that the other side gets around to checking the queue without having blocked, it'll see the updated value. If that works, that'd address the issue that we might wait unnecessarily in a number of common cases. Did not review the code. > remove-memory-leak-protection-v1.patch removes the memory leak > protection that Tom installed upon discovering that the original > version of tqueue.c leaked memory like crazy. I think that it > shouldn't do that any more, courtesy of > 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608. Assuming that's correct, we > can avoid a whole lot of tuple copying in Gather Merge and a much more > modest amount of overhead in Gather. Yup, that conceptually makes sense. Did not review the code. > Even with all of these patches applied, there's clearly still room for > more optimization, but MacOS's "sample" profiler seems to show that > the bottlenecks are largely shifting elsewhere: > > Sort by top of stack, same collapsed (when >= 5): > slot_getattr (in postgres) 706 > slot_deform_tuple (in postgres) 560 > ExecAgg (in postgres) 378 > ExecInterpExpr (in postgres) 372 > AllocSetAlloc (in postgres) 319 > _platform_memmove$VARIANT$Haswell (in > libsystem_platform.dylib) 314 > read (in libsystem_kernel.dylib) 303 > heap_compare_slots (in postgres) 296 > combine_aggregates (in postgres) 273 > shm_mq_receive_bytes (in postgres) 272 Interesting. Here it's + 8.79% postgres postgres [.] ExecAgg + 6.52% postgres postgres [.] slot_deform_tuple + 5.65% postgres postgres [.] slot_getattr + 4.59% postgres postgres [.] shm_mq_send_bytes + 3.66% postgres postgres [.] ExecInterpExpr + 3.44% postgres postgres [.] AllocSetAlloc + 3.08% postgres postgres [.] heap_fill_tuple + 2.34% postgres postgres [.] heap_getnext + 2.25% postgres postgres [.] finalize_aggregates + 2.08% postgres libc-2.24.so [.] __memmove_avx_unaligned_erms + 2.05% postgres postgres [.] heap_compare_slots + 1.99% postgres postgres [.] execTuplesMatch + 1.83% postgres postgres [.] ExecStoreTuple + 1.83% postgres postgres [.] shm_mq_receive + 1.81% postgres postgres [.] ExecScan > I'm probably not super-excited about spending too much more time > trying to make the _platform_memmove time (only 20% or so of which > seems to be due to the shm_mq stuff) or the shm_mq_receive_bytes time > go down until, say, somebody JIT's slot_getattr and slot_deform_tuple. > :-) Hm, let's say somebody were working on something like that. In that case the benefits for this precise plan wouldn't yet be that big because a good chunk of slot_getattr calls come from execTuplesMatch() which doesn't really provide enough context to do JITing (when used for hashaggs, there is more so it's JITed). Similarly gather merge's heap_compare_slots() doesn't provide such context. It's about ~9% currently, largely due to the faster aggregate invocation. But the big benefit here would be all the deforming and the comparisons... - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers