While investigating a performance issue, I found that it was extremely difficult to get a parallel plan in some cases due to the fixed parallel_tuple_cost. But this cost is not really fixed - it's going to be larger for larger tuples. So this proposal adjusts the cost used according to how large we expect the results to be. The result is that in the common case where, say, you're getting a group id and some aggregates, a parallel plan is more likely to be chosen. By contrast, queries that generate very wide results will be less likely to choose parallel plans. The formula chosen does have a fixed cost piece built into it, which accounts for the shm_mq_sendv() and shm_mq_receive() synchronization that occurs regardless of width.

The patch itself is pretty simple.

Also attached is a benchmark report that I had claude create. Its main result shows a speedup of about 2.7x.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From 4c944d4e7f4bbae7bdccd1949074e414a7b56b2e Mon Sep 17 00:00:00 2001
From: Andrew Dunstan <[email protected]>
Date: Mon, 30 Mar 2026 08:00:23 -0400
Subject: Scale parallel_tuple_cost by tuple width at Gather nodes

The parallel_tuple_cost GUC applies a flat per-tuple penalty to all
Gather and Gather Merge nodes, regardless of how wide or narrow the
tuples passing through the shared-memory queue actually are.  This
overcharges for narrow tuples (such as partial aggregate results with
a few integer columns) and undercharges for wide tuples.

The physical cost of the tuple queue is dominated by memcpy, which is
proportional to tuple width.  Introduce a width-based scaling factor
so that parallel_tuple_cost represents the cost at a reference width
of 100 bytes, with a 10% fixed floor for irreducible per-tuple queue
synchronization overhead.

For a Gather passing 12-byte partial aggregate tuples, the effective
per-tuple cost drops from 0.1 to ~0.02, which lets the planner choose
parallel plans for aggregation-heavy queries.

Tuples at the reference width (100 bytes) cost the same as before.
---
 src/backend/optimizer/path/costsize.c | 20 ++++++++++++--------
 src/backend/optimizer/plan/planner.c  |  4 +++-
 src/include/optimizer/cost.h          | 24 ++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1c575e56ff6..695cded910a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,7 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
- *	parallel_tuple_cost Cost of CPU time to pass a tuple from worker to leader backend
+ *	parallel_tuple_cost Cost of CPU time to pass a tuple from worker to leader
+ *						backend.  Scaled by tuple width relative to a reference
+ *						width (see width_adjusted_parallel_tuple_cost).
  *	parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
@@ -446,9 +448,10 @@ cost_gather(GatherPath *path, PlannerInfo *root,

 	run_cost = path->subpath->total_cost - path->subpath->startup_cost;

-	/* Parallel setup and communication cost. */
+	/* Parallel setup and communication cost, scaled by tuple width. */
 	startup_cost += parallel_setup_cost;
-	run_cost += parallel_tuple_cost * path->path.rows;
+	run_cost += width_adjusted_parallel_tuple_cost(path->path.pathtarget->width) *
+		path->path.rows;

 	path->path.disabled_nodes = path->subpath->disabled_nodes
 		+ ((rel->pgs_mask & PGS_GATHER) != 0 ? 0 : 1);
@@ -509,13 +512,14 @@ cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 	run_cost += cpu_operator_cost * path->path.rows;

 	/*
-	 * Parallel setup and communication cost.  Since Gather Merge, unlike
-	 * Gather, requires us to block until a tuple is available from every
-	 * worker, we bump the IPC cost up a little bit as compared with Gather.
-	 * For lack of a better idea, charge an extra 5%.
+	 * Parallel setup and communication cost, scaled by tuple width.  Since
+	 * Gather Merge, unlike Gather, requires us to block until a tuple is
+	 * available from every worker, we bump the IPC cost up a little bit as
+	 * compared with Gather.  For lack of a better idea, charge an extra 5%.
 	 */
 	startup_cost += parallel_setup_cost;
-	run_cost += parallel_tuple_cost * path->path.rows * 1.05;
+	run_cost += width_adjusted_parallel_tuple_cost(path->path.pathtarget->width) *
+		path->path.rows * 1.05;

 	path->path.disabled_nodes = path->subpath->disabled_nodes
 		+ ((rel->pgs_mask & PGS_GATHER_MERGE) != 0 ? 0 : 1);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d19800ad6a5..88d03ecfb4d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -580,7 +580,9 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 		gather->plan.startup_cost = top_plan->startup_cost +
 			parallel_setup_cost;
 		gather->plan.total_cost = top_plan->total_cost +
-			parallel_setup_cost + parallel_tuple_cost * top_plan->plan_rows;
+			parallel_setup_cost +
+			width_adjusted_parallel_tuple_cost(top_plan->plan_width) *
+			top_plan->plan_rows;
 		gather->plan.plan_rows = top_plan->plan_rows;
 		gather->plan.plan_width = top_plan->plan_width;
 		gather->plan.parallel_aware = false;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index f2fd5d31507..d7997779b3e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -175,6 +175,30 @@ extern void initial_cost_hashjoin(PlannerInfo *root,
 extern void final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 								JoinCostWorkspace *workspace,
 								JoinPathExtraData *extra);
+
+/*
+ * Width-adjusted parallel tuple cost.
+ *
+ * The cost of passing a tuple through the shared-memory tuple queue has a
+ * fixed component (queue synchronization, slot operations) and a variable
+ * component proportional to tuple width (memcpy into/out of the ring buffer).
+ * parallel_tuple_cost is calibrated for PARALLEL_TUPLE_COST_REF_WIDTH bytes;
+ * we scale proportionally so narrow tuples (e.g. partial aggregate results)
+ * are cheaper and wide tuples are more expensive.
+ *
+ * PARALLEL_TUPLE_COST_FIXED_FRAC is the irreducible per-tuple overhead
+ * (queue synchronization) as a fraction of the total cost at the reference
+ * width.
+ */
+#define PARALLEL_TUPLE_COST_REF_WIDTH	100 /* bytes */
+#define PARALLEL_TUPLE_COST_FIXED_FRAC	0.10	/* fixed overhead fraction */
+
+#define width_adjusted_parallel_tuple_cost(width) \
+	(parallel_tuple_cost * \
+	 (PARALLEL_TUPLE_COST_FIXED_FRAC + \
+	  (1.0 - PARALLEL_TUPLE_COST_FIXED_FRAC) * \
+	  (double) Max((width), 1) / PARALLEL_TUPLE_COST_REF_WIDTH))
+
 extern void cost_gather(GatherPath *path, PlannerInfo *root,
 						RelOptInfo *rel, ParamPathInfo *param_info, double *rows);
 extern void cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
--
2.43.0
Benchmark: Width-Adjusted parallel_tuple_cost
==============================================
Timestamp: 2026-03-30T12:19:00Z
Patch: 0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch
Base commit: 01d58d7e3ff (PostgreSQL 19devel)

Hardware
--------
Architecture: aarch64 (ARM64)
CPUs:         6
RAM:          11 GB
Disk:         SSD

PostgreSQL Configuration
------------------------
shared_buffers:                  2GB
work_mem:                        32MB
max_parallel_workers_per_gather: 4
max_parallel_workers:            8
parallel_tuple_cost:             0.1 (default)
io_method:                       sync
Build flags:                     -Dcassert=false -Db_ndebug=true 
-Dbuildtype=debugoptimized


Overview
--------
The parallel_tuple_cost GUC applies a flat per-tuple penalty to Gather
and Gather Merge nodes regardless of how wide the tuples are.  For
queries where partial aggregate results pass through the tuple queue,
these tuples are typically very narrow (8-52 bytes), but are charged the
same 0.1/tuple as wide rows.  This overcharges narrow-tuple Gathers and
can cause the planner to reject parallel plans that are 2-3x faster.

The patch scales parallel_tuple_cost by tuple width relative to a
100-byte reference, with a 10% fixed floor for irreducible queue
synchronization overhead:

    effective_cost = parallel_tuple_cost *
        (0.10 + 0.90 * max(width, 1) / 100)

Width 12 bytes  -> factor 0.208  -> effective cost 0.021/tuple
Width 52 bytes  -> factor 0.568  -> effective cost 0.057/tuple
Width 100 bytes -> factor 1.000  -> effective cost 0.100/tuple (unchanged)
Width 148 bytes -> factor 1.432  -> effective cost 0.143/tuple


Benchmark 1: Narrow-Output Aggregate (Plan Flip: Serial -> Parallel)
--------------------------------------------------------------------

Table setup:

    CREATE TABLE bench_wide AS
    SELECT
      i AS id,
      (i % 5000000) AS group_id,
      random() * 1000 AS val1,
      random() * 1000 AS val2,
      repeat('x', 200) AS padding
    FROM generate_series(1, 50000000) i;
    VACUUM ANALYZE bench_wide;

  50M rows, 12 GB on disk.
  5 columns: id int4 (4 bytes), group_id int4 (4 bytes),
             val1 float8 (8 bytes), val2 float8 (8 bytes),
             padding text (avg 204 bytes).
  5M distinct group_id values (10 rows per group).
  Source rows are wide (avg ~228 bytes) but the aggregate output is
  narrow: group_id + 3 aggregate accumulators = width 52 at Gather.

Query:

    SELECT group_id, count(*), sum(val1), avg(val2)
    FROM bench_wide
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 5M groups, this produces ~22.5M partial aggregate
rows (width 52) through Gather Merge.  The Gather Merge cost
contribution is the decisive factor:

  Unpatched: 0.1 * 22.5M         = 2,250,000
  Patched:   0.1 * 0.568 * 22.5M = 1,278,000  (43% less)

Results:

  UNPATCHED — planner chooses serial despite parallel being available:

    Limit  (cost=6734041..6734041 rows=10 width=28)
      ->  Sort  (cost=6734041..6748078 rows=5614666 width=28)
            ->  HashAggregate  (cost=5956600..6612710 rows=5614666 width=28)
                  Group Key: group_id
                  Planned Partitions: 32
                  ->  Seq Scan on bench_wide  (cost=0..2112919 rows=49999100 
width=20)

    Execution times: 30783ms, 27386ms, 24555ms, 24826ms
    Median: ~26s

  PATCHED — planner now correctly chooses parallel:

    Limit  (cost=5753518..5753518 rows=10 width=28)
      ->  Sort  (cost=5753518..5767555 rows=5614666 width=28)
            ->  Finalize GroupAggregate  (cost=3468705..5632187 rows=5614666 
width=28)
                  ->  Gather Merge  (cost=3468705..5337417 rows=22458664 
width=52)
                        Workers Planned: 4
                        ->  Partial GroupAggregate  (cost=3467705..3680099 
rows=5614666 width=52)
                              ->  Sort  (cost=3467705..3498955 rows=12499775 
width=20)
                                    ->  Parallel Seq Scan on bench_wide  
(cost=0..1737926 rows=12499775 width=20)

    Execution times: 9197ms, 9675ms, 9056ms, 11078ms
    Median: ~9.4s

  Verification — unpatched binary, parallel forced with 
parallel_tuple_cost=0.001:

    Same parallel plan structure as patched.
    Execution times: 9023ms, 9646ms, 9548ms, 11196ms
    Median: ~9.6s

  Summary:
    Unpatched (serial, planner's choice):   ~26s
    Patched   (parallel, planner's choice):  ~9.4s
    Speedup: 2.7x

    The parallel plan is genuinely faster.  The unpatched planner refused
    to pick it because the flat 0.1/tuple * 22.5M rows = 2.25M Gather
    cost made the parallel total (5.75M) appear close to the serial total
    (6.73M), and the serial plan avoided the Finalize GroupAggregate
    overhead.  With width adjustment, the Gather cost drops to 1.28M,
    making the parallel plan clearly cheaper.


Benchmark 2: Wide-Output Aggregate (No Regression)
---------------------------------------------------

Table setup:

    CREATE TABLE bench_narrow AS
    SELECT
      i AS id,
      (i % 500000) AS group_id,
      (random() * 1000)::numeric(10,2) AS val1,
      (random() * 1000)::numeric(10,2) AS val2,
      (random() * 1000)::numeric(10,2) AS val3,
      (random() * 1000)::numeric(10,2) AS val4,
      (random() * 1000)::numeric(10,2) AS val5,
      (random() * 1000)::numeric(10,2) AS val6,
      (random() * 1000)::numeric(10,2) AS val7,
      (random() * 1000)::numeric(10,2) AS val8
    FROM generate_series(1, 20000000) i;
    VACUUM ANALYZE bench_narrow;

  20M rows, 1776 MB on disk.
  10 columns: id int4 (4 bytes), group_id int4 (4 bytes),
              val1..val8 numeric(10,2) (avg 6 bytes each).
  500K distinct group_id values (40 rows per group).
  Source rows are narrow (avg ~52 bytes) but the aggregate output is
  wide: group_id + count + 4 sums + 4 avgs (each avg expands to
  sum + count internally) = width 268 at Gather Merge.

Query:

    SELECT group_id,
           count(*), sum(val1), sum(val2), sum(val3), sum(val4),
           avg(val5), avg(val6), avg(val7), avg(val8)
    FROM bench_narrow
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 500K groups, this produces ~2M partial aggregate
rows (width 268) through Gather Merge.  The width-adjusted cost
correctly charges MORE for these wide tuples:

  Unpatched Gather Merge: cost 1,372,038  (0.1 * 2.08M = 208K contribution)
  Patched   Gather Merge: cost 1,702,787  (0.1 * 2.412 * 2.08M = 502K 
contribution)

Both patched and unpatched choose the same parallel plan.

Results:

  UNPATCHED:

    Limit  (cost=1492668..1492668 rows=10 width=268)
      ->  Sort  (cost=1492668..1493970 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1481413 rows=520832 
width=268)
                  ->  Gather Merge  (cost=1122591..1372038 rows=2083328 
width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 
rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  
(cost=0..277330 rows=5000216 width=52)

    Execution times: 6525ms, 6575ms, 6447ms
    Median: ~6.5s

  PATCHED:

    Limit  (cost=1823417..1823417 rows=10 width=268)
      ->  Sort  (cost=1823417..1824719 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1812162 rows=520832 
width=268)
                  ->  Gather Merge  (cost=1122591..1702787 rows=2083328 
width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 
rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  
(cost=0..277330 rows=5000216 width=52)

    Execution times: 6784ms, 6869ms, 7047ms
    Median: ~6.9s

  Summary:
    Same plan on both.  Patched estimated cost is 22% higher (1.82M vs
    1.49M) because it correctly charges 2.41x the base rate for width-268
    tuples.  Execution times are within noise — the higher cost estimate
    does not cause a regression to serial.

Reply via email to