subject:"Re\: MergeJoin beats HashJoin in the case of multiple hash clauses"

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2023-11-03 Thread Tomas Vondra

On 9/11/23 10:04, Lepikhov Andrei wrote:
> 
> 
> On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote:
>> Hi, 
>>
>> On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov 
>>  wrote:
>>> Hi, all.
>>>
>>> Some of my clients use JOIN's with three - four clauses. Quite 
>>> frequently, I see complaints on unreasonable switch of JOIN algorithm to 
>>> Merge Join instead of Hash Join. Quick research have shown one weak 
>>> place - estimation of an average bucket size in final_cost_hashjoin (see 
>>> q2.sql in attachment) with very conservative strategy.
>>> Unlike estimation of groups, here we use smallest ndistinct value across 
>>> all buckets instead of multiplying them (or trying to make multivariate 
>>> analysis).
>>> It works fine for the case of one clause. But if we have many clauses, 
>>> and if each has high value of ndistinct, we will overestimate average 
>>> size of a bucket and, as a result, prefer to use Merge Join. As the 
>>> example in attachment shows, it leads to worse plan than possible, 
>>> sometimes drastically worse.
>>> I assume, this is done with fear of functional dependencies between hash 
>>> clause components. But as for me, here we should go the same way, as 
>>> estimation of groups.
>>

Yes, this analysis is correct - final_cost_hashjoin assumes the clauses
may be correlated (not necessarily by functional dependencies, just that
the overall ndistinct is not a simple product of per-column ndistincts).

And it even says so in the comment before calculating bucket size:

 * Determine bucketsize fraction and MCV frequency for the inner
 * relation. We use the smallest bucketsize or MCV frequency estimated
 * for any individual hashclause; this is undoubtedly conservative.

I'm sure this may lead to inflated cost for "good" cases (where the
actual bucket size really is a product), which may push the optimizer to
use the less efficient/slower join method.

Unfortunately, AFAICS the patch simply assumes the extreme in the
opposite direction - it assumes each clause splits the bucket for each
distinct value in the column. Which works great when it's true, but
surely it'd have issues when the columns are correlated?

I think this deserves more discussion, i.e. what happens if the
assumptions do not hold? We know what happens for the conservative
approach, but what's the worst thing that would happen for the
optimistic one?

I doubt e can simply switch from the conservative approach to the
optimistic one. Yes, it'll make some queries faster, but for other
queries it likely causes problems and slowdowns.

IMHO the only principled way forward is to get a better ndistinct
estimate (which this implicitly does), perhaps by using extended
statistics. I haven't tried, but I guess it'd need to extract the
clauses for the inner side, and call estimate_num_groups() on it.

This however reminds me we don't use extended statistics for join
clauses at all. Which means that even with accurate extended statistics,
we can still get stuff like this for multiple join clauses:

   Hash Join  (cost=1317.00..2386.00 rows=200 width=24)
  (actual time=85.781..8574.784 rows=800 loops=1)

This is unrelated to the issue discussed here, of course, as it won't
affect join method selection for that join. But it certainly will affect
all estimates/costs above that join, which can be pretty disastrous.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2023-09-07 Thread Bruce Momjian



Does anyone else have an opinion on this patch?  It looks promising.

---

On Wed, Jun 28, 2023 at 04:53:06PM +0300, Alena Rybakina wrote:
> Hi!
> 
> On 15.06.2023 11:30, Andrey Lepikhov wrote:
> 
> Hi, all.
> 
> Some of my clients use JOIN's with three - four clauses. Quite frequently,
> I see complaints on unreasonable switch of JOIN algorithm to Merge Join
> instead of Hash Join. Quick research have shown one weak place - 
> estimation
> of an average bucket size in final_cost_hashjoin (see q2.sql in 
> attachment)
> with very conservative strategy.
> Unlike estimation of groups, here we use smallest ndistinct value across
> all buckets instead of multiplying them (or trying to make multivariate
> analysis).
> It works fine for the case of one clause. But if we have many clauses, and
> if each has high value of ndistinct, we will overestimate average size of 
> a
> bucket and, as a result, prefer to use Merge Join. As the example in
> attachment shows, it leads to worse plan than possible, sometimes
> drastically worse.
> I assume, this is done with fear of functional dependencies between hash
> clause components. But as for me, here we should go the same way, as
> estimation of groups.
> The attached patch shows a sketch of the solution.
> 
> 
> This problem is very important.
> 
> Honestly, I'm still learning your code and looking for cases on which cases
> your patch can affect for the worse or for the better. But I have already 
> found
> something that seemed interesting to me. I have found several other 
> interesting
> cases where your patch can solve some problem in order to choose a more 
> correct
> plan, but in focus on memory consumption.
> 
> To make it easier to evaluate, I added a hook to your patch that makes it
> easier to switch to your or the original way of estimating the size of baskets
> (diff_estimate.diff).
> 
> Here are other cases where your fix improves the query plan.
> 
> 
> First of all, I changed the way creation of tables are created to look at the
> behavior of the query plan in terms of planning and execution time:
> 
> DROP TABLE IF EXISTS a,b CASCADE;
> CREATE TABLE a AS
>   SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
>   FROM generate_series(1,1e5) AS gs;
> CREATE TABLE b AS
>   SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
>   FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
> 
> SET enable_cost_size = 'on';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
> 
>     QUERY PLAN     
> ---
>  Hash Join (actual time=200.872..200.879 rows=0 loops=1)
>    Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
>    ->  Seq Scan on b (actual time=0.029..15.946 rows=10 loops=1)
>    ->  Hash (actual time=97.645..97.649 rows=10 loops=1)
>  Buckets: 131072  Batches: 1  Memory Usage: 5612kB
>  ->  Seq Scan on a (actual time=0.024..17.153 rows=10 loops=1)
>  Planning Time: 2.910 ms
>  Execution Time: 201.949 ms
> (8 rows)
> 
> SET
>     QUERY PLAN     
> ---
>  Merge Join (actual time=687.415..687.416 rows=0 loops=1)
>    Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z))
>    ->  Sort (actual time=462.022..536.716 rows=10 loops=1)
>  Sort Key: b.y, b.x, b.z
>  Sort Method: external merge  Disk: 3328kB
>  ->  Seq Scan on b (actual time=0.017..12.326 rows=10 loops=1)
>    ->  Sort (actual time=111.295..113.196 rows=16001 loops=1)
>  Sort Key: a.y, a.x, a.z
>  Sort Method: external sort  Disk: 2840kB
>  ->  Seq Scan on a (actual time=0.020..10.129 rows=10 loops=1)
>  Planning Time: 0.752 ms
>  Execution Time: 688.829 ms
> (12 rows)
> 
> Secondly, I found another case that is not related to the fact that the 
> planner
> would prefer to choose merge join rather than hash join, but we have the
> opportunity to see that the plan has become better due to the consumption of
> less memory, and also takes less planning time.
> 
> Here, with the same query, the planning time was reduced by 5 times, and the
> number of buckets by 128 times, therefore, memory consumption also decreased:
> 
> DROP TABLE IF EXISTS a,b CASCADE;
> 
> CREATE TABLE a AS
>   SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
>   FROM generate_series(1,600) AS gs;
> CREATE TABLE b AS
>   SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
>   FROM generate_series(1,1e5) AS gs;
> ANALYZ

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2023-09-10 Thread Andy Fan

Hi,

On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov 
wrote:

> Hi, all.
>
> Some of my clients use JOIN's with three - four clauses. Quite
> frequently, I see complaints on unreasonable switch of JOIN algorithm to
> Merge Join instead of Hash Join. Quick research have shown one weak
> place - estimation of an average bucket size in final_cost_hashjoin (see
> q2.sql in attachment) with very conservative strategy.
> Unlike estimation of groups, here we use smallest ndistinct value across
> all buckets instead of multiplying them (or trying to make multivariate
> analysis).
> It works fine for the case of one clause. But if we have many clauses,
> and if each has high value of ndistinct, we will overestimate average
> size of a bucket and, as a result, prefer to use Merge Join. As the
> example in attachment shows, it leads to worse plan than possible,
> sometimes drastically worse.
> I assume, this is done with fear of functional dependencies between hash
> clause components. But as for me, here we should go the same way, as
> estimation of groups.
>

I can reproduce the visitation you want to improve and verify the patch
can do it expectedly.  I think this is a right thing to do.

> The attached patch shows a sketch of the solution.
>

I understand that this is a sketch of the solution,  but the  below changes
still
make me confused.

+ if (innerbucketsize > virtualbuckets)
+ innerbucketsize = 1.0 / virtualbuckets;

innerbucketsize is a fraction of rows in all the rows, so it is between 0.0
and 1.0.
and virtualbuckets is the number of buckets in total (when considered the
mutli
batchs),  how is it possible for 'innerbucketsize > virtualbuckets' ?  Am
I missing something?

-- 
Best Regards
Andy Fan

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2023-09-11 Thread Lepikhov Andrei



On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote:
> Hi, 
>
> On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov 
>  wrote:
>> Hi, all.
>> 
>> Some of my clients use JOIN's with three - four clauses. Quite 
>> frequently, I see complaints on unreasonable switch of JOIN algorithm to 
>> Merge Join instead of Hash Join. Quick research have shown one weak 
>> place - estimation of an average bucket size in final_cost_hashjoin (see 
>> q2.sql in attachment) with very conservative strategy.
>> Unlike estimation of groups, here we use smallest ndistinct value across 
>> all buckets instead of multiplying them (or trying to make multivariate 
>> analysis).
>> It works fine for the case of one clause. But if we have many clauses, 
>> and if each has high value of ndistinct, we will overestimate average 
>> size of a bucket and, as a result, prefer to use Merge Join. As the 
>> example in attachment shows, it leads to worse plan than possible, 
>> sometimes drastically worse.
>> I assume, this is done with fear of functional dependencies between hash 
>> clause components. But as for me, here we should go the same way, as 
>> estimation of groups.
>
> I can reproduce the visitation you want to improve and verify the patch
> can do it expectedly.  I think this is a right thing to do.  
> 
>> The attached patch shows a sketch of the solution.
>
> I understand that this is a sketch of the solution,  but the  below 
> changes still
> make me confused. 
>
> + if (innerbucketsize > virtualbuckets)
> + innerbucketsize = 1.0 / virtualbuckets;
>
> innerbucketsize is a fraction of rows in all the rows, so it is between 
> 0.0 and 1.0.
> and virtualbuckets is the number of buckets in total (when considered 
> the mutli
> batchs),  how is it possible for 'innerbucketsize > virtualbuckets' ?  
> Am
> I missing something? 

You are right here. I've made a mistake here. Changed diff is in attachment.

-- 
Regards,
Andrei Lepikhov

fix_bucketsize_v2.diff
Description: Binary data

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2023-06-28 Thread Alena Rybakina


Hi!

On 15.06.2023 11:30, Andrey Lepikhov wrote:

Hi, all.

Some of my clients use JOIN's with three - four clauses. Quite 
frequently, I see complaints on unreasonable switch of JOIN algorithm 
to Merge Join instead of Hash Join. Quick research have shown one weak 
place - estimation of an average bucket size in final_cost_hashjoin 
(see q2.sql in attachment) with very conservative strategy.
Unlike estimation of groups, here we use smallest ndistinct value 
across all buckets instead of multiplying them (or trying to make 
multivariate analysis).
It works fine for the case of one clause. But if we have many clauses, 
and if each has high value of ndistinct, we will overestimate average 
size of a bucket and, as a result, prefer to use Merge Join. As the 
example in attachment shows, it leads to worse plan than possible, 
sometimes drastically worse.
I assume, this is done with fear of functional dependencies between 
hash clause components. But as for me, here we should go the same way, 
as estimation of groups.

The attached patch shows a sketch of the solution.


This problem is very important.

Honestly, I'm still learning your code and looking for cases on which 
cases your patch can affect for the worse or for the better. But I have 
already found something that seemed interesting to me. I have found 
several other interesting cases where your patch can solve some problem 
in order to choose a more correct plan, but in focus on memory consumption.


To make it easier to evaluate, I added a hook to your patch that makes 
it easier to switch to your or the original way of estimating the size 
of baskets (diff_estimate.diff).


Here are other cases where your fix improves the query plan.

First of all, I changed the way creation of tables are created to look 
at the behavior of the query plan in terms of planning and execution time:


DROP TABLE IF EXISTS a,b CASCADE;
CREATE TABLE a AS
  SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
  FROM generate_series(1,1e5) AS gs;
CREATE TABLE b AS
  SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
  FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

SET enable_cost_size = 'on';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

SET enable_cost_size = 'off';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;


    QUERY PLAN
---
 Hash Join (actual time=200.872..200.879 rows=0 loops=1)
   Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
   ->  Seq Scan on b (actual time=0.029..15.946 rows=10 loops=1)
   ->  Hash (actual time=97.645..97.649 rows=10 loops=1)
 Buckets: 131072  Batches: 1  Memory Usage: 5612kB
 ->  Seq Scan on a (actual time=0.024..17.153 rows=10 loops=1)
 Planning Time: 2.910 ms
 Execution Time: 201.949 ms
(8 rows)

SET
    QUERY PLAN
---
 Merge Join (actual time=687.415..687.416 rows=0 loops=1)
   Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z))
   ->  Sort (actual time=462.022..536.716 rows=10 loops=1)
 Sort Key: b.y, b.x, b.z
 Sort Method: external merge  Disk: 3328kB
 ->  Seq Scan on b (actual time=0.017..12.326 rows=10 loops=1)
   ->  Sort (actual time=111.295..113.196 rows=16001 loops=1)
 Sort Key: a.y, a.x, a.z
 Sort Method: external sort  Disk: 2840kB
 ->  Seq Scan on a (actual time=0.020..10.129 rows=10 loops=1)
 Planning Time: 0.752 ms
 Execution Time: 688.829 ms
(12 rows)

Secondly, I found another case that is not related to the fact that the 
planner would prefer to choose merge join rather than hash join, but we 
have the opportunity to see that the plan has become better due to the 
consumption of less memory, and also takes less planning time.


Here, with the same query, the planning time was reduced by 5 times, and 
the number of buckets by 128 times, therefore, memory consumption also 
decreased:


DROP TABLE IF EXISTS a,b CASCADE;

CREATE TABLE a AS
  SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
  FROM generate_series(1,600) AS gs;
CREATE TABLE b AS
  SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
  FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

SET enable_cost_size = 'on';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

SET enable_cost_size = 'off';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

QUERY PLAN

 Hash Join  (cost=20.50..3157.58 rows=8 width=32) (actual 
time=95.648..95.651 rows=0 loops=1)
   Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND 
(b.z = (a.z)::numeric))
   ->  Seq Scan on b  (cost=0.00

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2024-07-08 Thread Andrei Lepikhov


On 3/11/2023 23:43, Tomas Vondra wrote:

On 9/11/23 10:04, Lepikhov Andrei wrote:
  * Determine bucketsize fraction and MCV frequency for the inner
  * relation. We use the smallest bucketsize or MCV frequency estimated
  * for any individual hashclause; this is undoubtedly conservative.

I'm sure this may lead to inflated cost for "good" cases (where the
actual bucket size really is a product), which may push the optimizer to
use the less efficient/slower join method.

Yes, It was contradictory idea, though.

IMHO the only principled way forward is to get a better ndistinct
estimate (which this implicitly does), perhaps by using extended
statistics. I haven't tried, but I guess it'd need to extract the
clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of 
extended statistics for estimation of the HashJoin bucket_size. In 
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows 
to reduce computational cost of this statistic.


[1] 
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web


--
regards, Andrei Lepikhov
From b4b007970b1d9b99602b8422f2122c8e5738828e Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Mon, 13 May 2024 16:15:02 +0700
Subject: [PATCH] Use extended statistics for precise estimation of bucket size
 in hash join.

Recognizing the real-life complexity where columns in the table often have
functional dependencies, PostgreSQL's estimation of the number of distinct
values over a set of columns can be underestimated (or much rarely,
overestimated) when dealing with multi-clause JOIN.
In the case of hash join, it can end up with a small number of predicted hash
buckets and, as a result, picking non-optimal merge join.

To improve the situation, we introduce one additional stage of bucket size
estimation - having two or more join clauses estimator lookup for extended
statistics and use it for multicolumn estimation.
Clauses are grouped into lists, each containing expressions referencing the
same relation. The result of the multicolumn estimation made over such a list
is combined with others according to the caller's logic.
Clauses that are not estimated are returned to the caller for further
estimation.
---
 src/backend/optimizer/path/costsize.c |  12 +-
 src/backend/utils/adt/selfuncs.c  | 171 ++
 src/include/utils/selfuncs.h  |   4 +
 3 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index ee23ed7835..eafde90d4c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4250,9 +4250,19 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
}
else
{
+   List *otherclauses;
+
innerbucketsize = 1.0;
innermcvfreq = 1.0;
-   foreach(hcl, hashclauses)
+
+   /* At first, try to estimate bucket size using extended 
statistics. */
+   otherclauses = estimate_multivariate_bucketsize(root,
+   
inner_path->parent,
+   
hashclauses,
+   
&innerbucketsize);
+
+   /* Pass through the remaining clauses */
+   foreach(hcl, otherclauses)
{
RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, 
hcl);
Selectivity thisbucketsize;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..89ed2d136a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3751,6 +3751,177 @@ estimate_num_groups(PlannerInfo *root, List 
*groupExprs, double input_rows,
return numdistinct;
 }
 
+/*
+ * Try to estimate the bucket size of the hash join inner side when the join
+ * condition contains two or more clauses by employing extended statistics.
+ *
+ * The main idea of this approach is that the distinct value generated by
+ * multivariate estimation on two or more columns would provide less bucket 
size
+ * than estimation on one separate column.
+ *
+ * IMPORTANT: It is crucial to synchronise the approach of combining different
+ * estimations with the caller's method.
+ * Return a list of clauses that didn't fetch any extended statistics.
+ */
+List *
+estimate_multivariate_bucketsize(PlannerInfo *root, RelOptInfo *inner,
+List 
*hashclauses,
+

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

6 matches

Site Navigation

Mail list logo

Footer information