HI > This is a very interesting observation. In fact, with the original v5 > patch, you can produce both plans by setting enable_hashagg on and > off.
> set enable_hashagg to on; > Incremental Sort (cost=91.95..277.59 rows=2500 width=16) > (actual time=31.960..147.040 rows=90000.00 loops=1) > > set enable_hashagg to off; > Merge Join (cost=70.14..294.34 rows=2500 width=16) > (actual time=4.303..71.891 rows=90000.00 loops=1) > > This seems to be another case where the planner chooses a suboptimal > plan due to inaccurate cost estimates. Agree ,I increased some rows , set enable_hashagg to on and off ,There's no difference in execution time. The execution plan is the same. > #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \ > ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI && \ > bms_equal((sjinfo)->syn_righthand, (rel)->relids)) > >... and then the check in final_cost_hashjoin() becomes: > > if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo, > path->jpath.jointype)) > { > innerbucketsize = 1.0 / virtualbuckets; > innermcvfreq = 0.0; > } > >Would this be a better approach? Any thoughts? This approach does indeed more accurately capture the fact that the relation has been unique-ified, especially in cases where a semi join has been transformed into an inner join. Compared to the current heuristic checks in costsize.c that rely on inner_path->rows, this method is more semantically meaningful and robust. On Thu, Jul 31, 2025 at 4:58 PM Richard Guo <guofengli...@gmail.com> wrote: > On Thu, Jul 31, 2025 at 1:08 PM Richard Guo <guofengli...@gmail.com> > wrote: > > On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang > > <alexandra.wang....@gmail.com> wrote: > > > While looking at the code, I also had a question about the following > > > changes in costsize.c: > > > > > > --- a/src/backend/optimizer/path/costsize.c > > > +++ b/src/backend/optimizer/path/costsize.c > > > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, > MergePath *path, > > > * The whole issue is moot if we are working from a unique-ified outer > > > * input, or if we know we don't need to mark/restore at all. > > > */ > > > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore) > > > + if (IsA(outer_path, UniquePath) || > > > + IsA(outer_path, AggPath) || > > > + path->skip_mark_restore) > > > > > > and > > > > > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath > *path, > > > * because we avoid contaminating the cache with a value that's wrong > for > > > * non-unique-ified paths. > > > */ > > > - if (IsA(inner_path, UniquePath)) > > > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath)) > > > > > > I'm curious why AggPath was added in these two cases. > > > Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have > > some special cases when the inner or outer relation has been > > unique-ified. Previously, it was sufficient to check whether the path > > was a UniquePath, since both hash-based and sort-based implementations > > were represented that way. However, with this patch, UniquePath now > > only represents the sort-based implementation, so we also need to > > check for AggPath to account for the hash-based case. > > BTW, maybe a better way to determine whether a relation has been > unique-ified is to check that the nominal jointype is JOIN_INNER and > sjinfo->jointype is JOIN_SEMI, and the relation is exactly the RHS of > the semijoin. This approach is mentioned in a comment in joinpath.c: > > * Path cost estimation code may need to recognize that it's > * dealing with such a case --- the combination of nominal jointype INNER > * with sjinfo->jointype == JOIN_SEMI indicates that. > > ... but it seems we don't currently apply it in costsize.c. > > To be concrete, I'm imagining a check like the following: > > #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \ > ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI > && \ > bms_equal((sjinfo)->syn_righthand, (rel)->relids)) > > ... and then the check in final_cost_hashjoin() becomes: > > if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo, > path->jpath.jointype)) > { > innerbucketsize = 1.0 / virtualbuckets; > innermcvfreq = 0.0; > } > > Would this be a better approach? Any thoughts? > > Thanks > Richard >