Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Andrei Lepikhov Mon, 08 Jul 2024 05:45:35 -0700

On 3/11/2023 23:43, Tomas Vondra wrote:

On 9/11/23 10:04, Lepikhov Andrei wrote:
  * Determine bucketsize fraction and MCV frequency for the inner
  * relation. We use the smallest bucketsize or MCV frequency estimated
  * for any individual hashclause; this is undoubtedly conservative.


I'm sure this may lead to inflated cost for "good" cases (where the
actual bucket size really is a product), which may push the optimizer to
use the less efficient/slower join method.

Yes, It was contradictory idea, though.

IMHO the only principled way forward is to get a better ndistinct
estimate (which this implicitly does), perhaps by using extended
statistics. I haven't tried, but I guess it'd need to extract the
clauses for the inner side, and call estimate_num_groups() on it.

And I've done it. Sorry for so long response. This patch employs ofextended statistics for estimation of the HashJoin bucket_size. Inaddition, I describe the idea in more convenient form here [1].Obviously, it needs the only ndistinct to make a prediction that allowsto reduce computational cost of this statistic.

[1]https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web


--
regards, Andrei Lepikhov

From b4b007970b1d9b99602b8422f2122c8e5738828e Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepi...@gmail.com>
Date: Mon, 13 May 2024 16:15:02 +0700
Subject: [PATCH] Use extended statistics for precise estimation of bucket size
 in hash join.

Recognizing the real-life complexity where columns in the table often have
functional dependencies, PostgreSQL's estimation of the number of distinct
values over a set of columns can be underestimated (or much rarely,
overestimated) when dealing with multi-clause JOIN.
In the case of hash join, it can end up with a small number of predicted hash
buckets and, as a result, picking non-optimal merge join.

To improve the situation, we introduce one additional stage of bucket size
estimation - having two or more join clauses estimator lookup for extended
statistics and use it for multicolumn estimation.
Clauses are grouped into lists, each containing expressions referencing the
same relation. The result of the multicolumn estimation made over such a list
is combined with others according to the caller's logic.
Clauses that are not estimated are returned to the caller for further
estimation.
---
 src/backend/optimizer/path/costsize.c |  12 +-
 src/backend/utils/adt/selfuncs.c      | 171 ++++++++++++++++++++++++++
 src/include/utils/selfuncs.h          |   4 +
 3 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index ee23ed7835..eafde90d4c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4250,9 +4250,19 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
        }
        else
        {
+               List *otherclauses;
+
                innerbucketsize = 1.0;
                innermcvfreq = 1.0;
-               foreach(hcl, hashclauses)
+
+               /* At first, try to estimate bucket size using extended 
statistics. */
+               otherclauses = estimate_multivariate_bucketsize(root,
+                                                                               
                                inner_path->parent,
+                                                                               
                                hashclauses,
+                                                                               
                                &innerbucketsize);
+
+               /* Pass through the remaining clauses */
+               foreach(hcl, otherclauses)
                {
                        RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, 
hcl);
                        Selectivity thisbucketsize;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..89ed2d136a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3751,6 +3751,177 @@ estimate_num_groups(PlannerInfo *root, List 
*groupExprs, double input_rows,
        return numdistinct;
 }
 
+/*
+ * Try to estimate the bucket size of the hash join inner side when the join
+ * condition contains two or more clauses by employing extended statistics.
+ *
+ * The main idea of this approach is that the distinct value generated by
+ * multivariate estimation on two or more columns would provide less bucket 
size
+ * than estimation on one separate column.
+ *
+ * IMPORTANT: It is crucial to synchronise the approach of combining different
+ * estimations with the caller's method.
+ * Return a list of clauses that didn't fetch any extended statistics.
+ */
+List *
+estimate_multivariate_bucketsize(PlannerInfo *root, RelOptInfo *inner,
+                                                                List 
*hashclauses,
+                                                                Selectivity 
*innerbucketsize)
+{
+       List   *clauses = list_copy(hashclauses);
+       List   *otherclauses = NIL;
+       double  ndistinct = 1.0;
+
+       if (list_length(hashclauses) <= 1)
+               /*
+                * Nothing to do for a single clause. Could we employ univariate
+                * extended stat here?
+                */
+               return hashclauses;
+
+       while (clauses != NIL)
+       {
+               ListCell   *lc;
+               int                     relid = -1;
+               List       *varinfos = NIL;
+               List       *saveList = NIL;
+               double          mvndistinct;
+               List       *origin_varinfos;
+               int                     group_relid = -1;
+               RelOptInfo *group_rel;
+               ListCell   *lc1, *lc2;
+
+               /*
+                * Find clauses, referencing the same single base relation and 
try to
+                * estimate such a group with extended statistics.
+                * Create varinfo for an approved clause, push it to 
otherclauses, if it
+                * can't be estimated here or ignore to process at the next 
iteration.
+                */
+               foreach(lc, clauses)
+               {
+                       RestrictInfo   *rinfo = lfirst_node(RestrictInfo, lc);
+                       Node               *expr;
+                       Relids                  relids;
+                       GroupVarInfo   *varinfo;
+
+                       /*
+                        * Find the inner side of the join which we need to 
estimate the
+                        * number of buckets. Use outer_is_left because the
+                        * clause_sides_match_join routine has called on hash 
clauses.
+                        */
+                       relids = rinfo->outer_is_left ?
+                                               rinfo->right_relids : 
rinfo->left_relids;
+                       expr = rinfo->outer_is_left ?
+                                               get_rightop(rinfo->clause) : 
get_leftop(rinfo->clause);
+
+                       if (bms_get_singleton_member(relids, &relid) &&
+                               root->simple_rel_array[relid]->statlist != NIL)
+                       {
+                               /*
+                                * This inner-side expression references only 
one relation.
+                                * Extended statistics on this clause can exist.
+                                */
+
+                               if (group_relid < 0)
+                               {
+                                       RangeTblEntry *rte = 
root->simple_rte_array[relid];
+
+                                       if (!rte || (rte->relkind != 
RELKIND_RELATION &&
+                                               rte->relkind != RELKIND_MATVIEW 
&&
+                                               rte->relkind != 
RELKIND_FOREIGN_TABLE &&
+                                               rte->relkind != 
RELKIND_PARTITIONED_TABLE))
+                                       {
+                                               /* Extended statistics can't 
exist in principle */
+                                               otherclauses = 
lappend(otherclauses, rinfo);
+                                               clauses = 
foreach_delete_current(clauses, lc);
+                                               continue;
+                                       }
+
+                                       group_relid = relid;
+                                       group_rel = 
root->simple_rel_array[relid];
+                                       Assert(group_rel != NULL);
+                               }
+                               else if (group_relid != relid)
+                                       /*
+                                        * Being in the group forming state we 
don't
+                                        * need other clauses.
+                                        */
+                                       continue;
+
+                               varinfo = (GroupVarInfo *) 
palloc(sizeof(GroupVarInfo));
+                               varinfo->var = expr;
+                               varinfo->rel = root->simple_rel_array[relid];
+                               varinfo->ndistinct = 0.0;
+                               varinfo->isdefault = false;
+                               varinfos = lappend(varinfos, varinfo);
+
+                               /*
+                                * Remember the link to RestrictInfo for the 
case the clause is
+                                * failed to be estimated.
+                                */
+                               saveList = lappend(saveList, rinfo);
+                       }
+                       else
+                               /* This clause can't be estimated with extended 
statistics */
+                               otherclauses = lappend(otherclauses, rinfo);
+
+                       clauses = foreach_delete_current(clauses, lc);
+               }
+
+               if (list_length(varinfos) < 2)
+               {
+                       /*
+                        * Multivariate statistics doesn't apply to single 
column except
+                        * expressions, but it has not implemented yet.
+                        */
+                       otherclauses = list_concat(otherclauses, saveList);
+                       list_free_deep(varinfos);
+                       list_free(saveList);
+                       continue;
+               }
+
+               /* Let's employ extended statistics. */
+               origin_varinfos = varinfos;
+               for (;;)
+               {
+                       bool    estimated = 
estimate_multivariate_ndistinct(root,
+                                                                               
                                                group_rel,
+                                                                               
                                                &varinfos,
+                                                                               
                                                &mvndistinct);
+                       if (!estimated)
+                               break;
+
+                       /*
+                        * We've got an estimation. Use ndistinct value in 
consistent way -
+                        * according to the caller's logic (see 
final_cost_hashjoin).
+                        */
+                       if (ndistinct < mvndistinct)
+                               ndistinct = mvndistinct;
+                       Assert(ndistinct >= 1.0);
+               }
+
+               Assert(list_length(origin_varinfos) == list_length(saveList));
+
+               /*
+                * Remove already estimated clauses from the saveList
+                */
+               forboth(lc1, origin_varinfos, lc2, saveList)
+               {
+                       GroupVarInfo *vinfo = lfirst(lc1);
+
+                       if (!list_member_ptr(varinfos, vinfo))
+                               /* Already estimated */
+                               continue;
+
+                       /* Can't be estimated here - push to the returning list 
*/
+                       otherclauses = lappend(otherclauses, lfirst(lc2));
+               }
+       }
+
+       *innerbucketsize = 1.0 / ndistinct;
+       return otherclauses;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..ac62d2fa0e 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -217,6 +217,10 @@ extern double estimate_num_groups(PlannerInfo *root, List 
*groupExprs,
                                                                  double 
input_rows, List **pgset,
                                                                  
EstimationInfo *estinfo);
 
+extern List *estimate_multivariate_bucketsize(PlannerInfo *root,
+                                                                               
          RelOptInfo *inner,
+                                                                               
          List *hashclauses,
+                                                                               
          Selectivity *innerbucketsize);
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
                                                                           Node 
*hashkey, double nbuckets,
                                                                           
Selectivity *mcv_freq,
-- 
2.45.2

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Reply via email to