incremental sort vs. gather paths
This code introduced in ba3e76cc571eba3dea19c9465ff15ac3ac186576 looks wrong to me: total_groups = cheapest_partial_path->rows * cheapest_partial_path->parallel_workers; path = (Path *) create_gather_merge_path(root, ordered_rel, path, path->pathtarget, root->sort_pathkeys, NULL, &total_groups); This too: total_groups = input_path->rows * input_path->parallel_workers; This came to my attention because Dave Page sent me a query plan that looks like this: Gather Merge (cost=22617.94..22703.35 rows=732 width=97) (actual time=2561.476..2561.856 rows=879 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=21617.92..21618.83 rows=366 width=97) (actual time=928.329..928.378 rows=293 loops=3) Sort Key: aid Sort Method: quicksort Memory: 155kB Worker 0: Sort Method: quicksort Memory: 25kB Worker 1: Sort Method: quicksort Memory: 25kB -> Parallel Seq Scan on accounts_1m (cost=0.00..21602.33 rows=366 width=97) (actual time=74.391..74.518 rows=293 loops=3) Filter: (aid < 1000) Rows Removed by Filter: 333040 If you look at the actual row count estimates, you see that the Gather Merge produces 3 times the number of rows that the Parallel Seq Scan produces, which is completely correct, because the raw number is 897 in both cases, but EXPLAIN unhelpfully divides the displayed row count by the loop count, which in this case is 3. If you look at the estimated row count, you see that the Gather Merge is estimated to produce exactly 2 times the number of rows that the nodes under it would produce. That's not a very good estimate, unless parallel_leader_participation=off, which in this case it isn't. What's happening here is that the actual number of rows produced by accounts_1m is actually 879 and is estimated as 879. get_parallel_divisor() decides to divide that number by 2.4, and gets 366, because it thinks the leader will do less work than the other workers. Then the code above fires and, instead of letting the original row count estimate for accounts_1m apply to the Gather Merge path, it overrides it with 2 * 366 = 732. This cannot be right. Really, I don't think it should be overriding the row count estimate at all. But if it is, multiplying by the number of workers can't be right, because the leader can also do stuff. -- Robert Haas EDB: http://www.enterprisedb.com
Re: incremental sort vs. gather paths
On 12/16/21 17:48, Robert Haas wrote: This code introduced in ba3e76cc571eba3dea19c9465ff15ac3ac186576 looks wrong to me: total_groups = cheapest_partial_path->rows * cheapest_partial_path->parallel_workers; path = (Path *) create_gather_merge_path(root, ordered_rel, path, path->pathtarget, root->sort_pathkeys, NULL, &total_groups); This too: total_groups = input_path->rows * input_path->parallel_workers; This came to my attention because Dave Page sent me a query plan that looks like this: Gather Merge (cost=22617.94..22703.35 rows=732 width=97) (actual time=2561.476..2561.856 rows=879 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=21617.92..21618.83 rows=366 width=97) (actual time=928.329..928.378 rows=293 loops=3) Sort Key: aid Sort Method: quicksort Memory: 155kB Worker 0: Sort Method: quicksort Memory: 25kB Worker 1: Sort Method: quicksort Memory: 25kB -> Parallel Seq Scan on accounts_1m (cost=0.00..21602.33 rows=366 width=97) (actual time=74.391..74.518 rows=293 loops=3) Filter: (aid < 1000) Rows Removed by Filter: 333040 If you look at the actual row count estimates, you see that the Gather Merge produces 3 times the number of rows that the Parallel Seq Scan produces, which is completely correct, because the raw number is 897 in both cases, but EXPLAIN unhelpfully divides the displayed row count by the loop count, which in this case is 3. If you look at the estimated row count, you see that the Gather Merge is estimated to produce exactly 2 times the number of rows that the nodes under it would produce. That's not a very good estimate, unless parallel_leader_participation=off, which in this case it isn't. What's happening here is that the actual number of rows produced by accounts_1m is actually 879 and is estimated as 879. get_parallel_divisor() decides to divide that number by 2.4, and gets 366, because it thinks the leader will do less work than the other workers. Then the code above fires and, instead of letting the original row count estimate for accounts_1m apply to the Gather Merge path, it overrides it with 2 * 366 = 732. This cannot be right. Really, I don't think it should be overriding the row count estimate at all. But if it is, multiplying by the number of workers can't be right, because the leader can also do stuff. Maybe, but other places (predating incremental sort) creating Gather Merge do the same thing, and commit ba3e76cc57 merely copied this. For example generate_gather_paths() does this: foreach(lc, rel->partial_pathlist) { Path *subpath = (Path *) lfirst(lc); GatherMergePath *path; if (subpath->pathkeys == NIL) continue; rows = subpath->rows * subpath->parallel_workers; path = create_gather_merge_path(root, rel, subpath, rel->reltarget, subpath->pathkeys, NULL, rowsp); add_path(rel, &path->path); } i.e. it's doing the same (rows * parallel_workers) calculation. It may not not be the right way to estimate this, of course. But I'd say if ba3e76cc57 is doing it wrong, so are the older places. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: incremental sort vs. gather paths
On Thu, Dec 16, 2021 at 12:16 PM Tomas Vondra wrote: > Maybe, but other places (predating incremental sort) creating Gather > Merge do the same thing, and commit ba3e76cc57 merely copied this. For > example generate_gather_paths() does this: > > foreach(lc, rel->partial_pathlist) > { > Path *subpath = (Path *) lfirst(lc); > GatherMergePath *path; > > if (subpath->pathkeys == NIL) > continue; > > rows = subpath->rows * subpath->parallel_workers; > path = create_gather_merge_path(root, rel, subpath, > rel->reltarget, > subpath->pathkeys, NULL, rowsp); > add_path(rel, &path->path); > } > > i.e. it's doing the same (rows * parallel_workers) calculation. Ugh. I was hoping this mess wasn't my fault, but it seems that it is. :-( -- Robert Haas EDB: http://www.enterprisedb.com