As part of making tuplestores faster [1], I noticed that in WindowAgg, when we end one partition we call tuplestore_end() and then we do tuplestore_begin_heap() again for the next partition in begin_partition() and then go on to set up the tuplestore read pointers according to what's required for the given frameOptions of the WindowAgg. This might make sense if the frameOptions could change between partitions, but they can't, so I don't see any reason why we can't just do tuplestore_clear() at the end of a partition. That resets the read pointer positions back to the start again ready for the next partition.
I wrote the attached patch and checked how it affects performance. It helps quite a bit when there are lots of partitions. CREATE TABLE a (a INT NOT NULL); INSERT INTO a SELECT x FROM generate_series(1,1000000)x; VACUUM FREEZE ANALYZE a; bench.sql: SELECT a,count(*) OVER (PARTITION BY a) FROM a OFFSET 1000000; master: $ pgbench -n -f bench.sql -T 60 -M prepared postgres | grep latency latency average = 293.488 ms latency average = 295.509 ms latency average = 297.772 ms patched: $ pgbench -n -f bench.sql -T 60 -M prepared postgres | grep latency latency average = 203.234 ms latency average = 204.538 ms latency average = 203.877 ms About 45% faster. Here's the top of perf top of each: master: 8.61% libc.so.6 [.] _int_malloc 6.71% libc.so.6 [.] _int_free 6.42% postgres [.] heap_form_minimal_tuple 6.40% postgres [.] tuplestore_alloc_read_pointer 5.87% libc.so.6 [.] unlink_chunk.constprop.0 3.86% libc.so.6 [.] __memmove_avx512_unaligned_erms 3.80% postgres [.] AllocSetFree 3.51% postgres [.] spool_tuples 3.45% postgres [.] ExecWindowAgg 2.09% postgres [.] tuplesort_puttuple_common 1.92% postgres [.] comparetup_datum 1.88% postgres [.] AllocSetAlloc 1.85% postgres [.] tuplesort_heap_replace_top 1.84% postgres [.] ExecStoreBufferHeapTuple 1.69% postgres [.] window_gettupleslot patched: 8.95% postgres [.] ExecWindowAgg 8.10% postgres [.] heap_form_minimal_tuple 5.16% postgres [.] spool_tuples 4.03% libc.so.6 [.] __memmove_avx512_unaligned_erms 4.02% postgres [.] begin_partition 3.11% postgres [.] tuplesort_puttuple_common 2.97% postgres [.] AllocSetAlloc 2.96% postgres [.] comparetup_datum 2.83% postgres [.] tuplesort_heap_replace_top 2.79% postgres [.] window_gettupleslot 2.22% postgres [.] AllocSetFree 2.11% postgres [.] MemoryContextReset 1.99% postgres [.] LogicalTapeWrite 1.98% postgres [.] advance_windowaggregate Lots less malloc/free work going on. I'm also tempted to do a cleanup of the state machine inside nodeWindowAgg.c as I had to add another bool flag to WindowAggState to replace the previous test that checked if the tuplestore was empty (i.e just finished a partition) with if (winstate->buffer == NULL). I couldn't do that anymore due to no longer nullifying that field. I think such a cleanup would be a separate patch. Annoyingly the extra bool is the 9th bool flag and widens that struct by 8 bytes and leaves a 7 byte hole. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=590b045c37aad44915f7f472343f24c2bafbe5d8
v1-0001-Optimize-WindowAgg-s-use-of-tuplestores.patch
Description: Binary data