On 2026-03-30 Mo 6:51 PM, David Rowley wrote:
On Tue, 31 Mar 2026 at 03:17, Tom Lane <[email protected]> wrote:Andrew Dunstan <[email protected]> writes:While investigating a performance issue, I found that it was extremely difficult to get a parallel plan in some cases due to the fixed parallel_tuple_cost. But this cost is not really fixed - it's going to be larger for larger tuples. So this proposal adjusts the cost used according to how large we expect the results to be.Unfortunately, I'm afraid that this is going to produce mostly "garbage in, garbage out" estimates, because our opinion of how wide tuples-in-flight are is pretty shaky. (See get_expr_width and particularly get_typavgwidth, and note that we only have good statistics-based numbers for plain Vars not function outputs.) I agree that it could be useful to have some kind of adjustment here, but I fear that linear scaling is putting way too much faith in the quality of the data.(I suspect you're saying this because of the "Benchmark 2" in the text file, which contains aggregates which return a varlena type, of which we won't estimate the width very well for...)Sure, it's certainly true that there are cases where we don't get the width estimate right, but that's not stopped us using them before. So why is this case so much more critical? We now also have GROUP BY before join abilities in the planner, which I suspect must also be putting trust into the very same thing. Also, varlena-returning Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why avoid making improvements in this area because we're not great at one of the things that could be in the targetlist? For the patch and the analysis: This reminds me of [1], where some reverse-engineering of costs from query run-times was done, which ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To get that for this case would require various tests with different tuple widths and ensuring that the costs scale linearly with the run-time of the query with the patched version. Of course, the test query would have to have perfect width estimates, but that could be easy enough to do by populating a text typed GROUP BY column and populating that with all the same width of text for each of the tests before increasing the width for the next test, using a fixed-width aggregate each time, e.g count(*). The "#define PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round number. It would be good to know how close this is to reality. Ideally, it would be good to see results from an Apple M<something> chip and recent x86. In my experience, these produce very different performance results, so it might be nice to find a value that is somewhere in the middle of what we get from those machines. I suspect having the GROUP BY column with text widths from 8 to 1024, increasing in powers of two would be enough data points.
I followed your suggested methodology to measure how Gather IPC cost actually scales with tuple width. Setup: 10M rows, 100K distinct text values per table, text column padded to a fixed width with lpad(). Query: SELECT txt, count(*) FROM ptc_bench_W GROUP BY txt. This produces Partial HashAggregate in workers, then Gather passes ~240K partial-aggregate tuples whose width is dominated by the text column. 2 workers, work_mem=256MB, cassert=off, debugoptimized build, aarch64 Linux. I tested widths from 8 to 1024 bytes (10 data points). For each width, I ran 5 iterations of both parallel and serial execution, and computed the Gather overhead as: overhead = T_parallel - T_serial / 3 This isolates the IPC cost: the serial time captures pure scan + aggregate work, and dividing by 3 gives the ideal parallel time (2 workers + leader). The excess is Gather overhead. Results (microseconds per tuple through Gather, median of 5 runs): Width(B) us/tuple Implied ptc (if ptc=0.1 at w=100) -------- -------- ---------------------------------- 8 0.30 0.032 16 0.24 0.025 32 0.77 0.083 64 0.72 0.078 128 1.03 0.111 256 1.62 0.175 384 2.90 0.313 512 3.21 0.346 768 4.12 0.445 1024 5.56 0.600 The best-fit models: Linear: cost(w) = 0.42 + 0.0051 * w R² = 0.983 Power law: cost(w) = 0.061 * w^0.63 R² = 0.966 Linear fits best. One notable finding: at the proposed reference width of 100 bytes, the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of which 0.42 is fixed. The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10, which substantially underestimates the width-independent component. A higher fixed fraction would dampen the width adjustment, which also partly addresses Tom's concern about sensitivity to width estimate errors: with ~45% of the cost being fixed, even a 2x error in width only translates to a ~1.5x error in total cost. The script used to get the timings is attached. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
ptc_calibrate.sh
Description: application/shellscript
