[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

Haisheng Yuan (JIRA) Fri, 19 Apr 2019 10:31:19 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822079#comment-16822079
 ]


Haisheng Yuan commented on CALCITE-2973:
----------------------------------------

EnumerableJoin and EnumerableThetaJoin are not good name for physical 
operators, we should get rid of them. As physical operators, they failed to 
indicate how the join is performed. I don't know if there is any other 
commercial databases showing Join or ThetaJoin in the execution plan. Do we 
really care about equijoin or non-equijoin? I think what we really care about 
is whether we can create hash join alternative for it, no matter it is equijoin 
or non-equijoin. So for non-correlated join we only need 3 physical joins, no 
more, no less:
EnumerableNestedLoopJoin, EnumerableHashJoin, EnumerableMergeJoin.

Current EnumerableThetaJoin should be rename to EnumerableNestedLoopJoin, 
EnumerableJoin should be renamed to EnumerableHashJoin. Both EnumerableHashJoin 
and EnumerableMergeJoin should be extended to be able to deal with 
non-equijoin, as long as there is a join condition with equality operator and 
both sides of the operator uses columns from each single relation. Moreover, it 
will be nice to enable merge join to deal with join condition with range 
comparison, as Julian mentioned above.

Postgres already gives us a very good example:

{code:sql}
h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0;
                           QUERY PLAN
-----------------------------------------------------------------
 Hash Right Join  (cost=1.23..40.55 rows=10 width=24)
   Hash Cond: (bar.b = foo.a)
   Join Filter: (foo.c < 0)
   ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=1.10..1.10 rows=10 width=12)
         ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
(6 rows)

h.yuan=# set enable_hashjoin=false;
h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0;
                             QUERY PLAN
--------------------------------------------------------------------
 Merge Left Join  (cost=143.81..155.33 rows=10 width=24)
   Merge Cond: (foo.a = bar.b)
   Join Filter: (foo.c < 0)
   ->  Sort  (cost=1.27..1.29 rows=10 width=12)
         Sort Key: foo.a
         ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
   ->  Sort  (cost=142.54..147.64 rows=2040 width=12)
         Sort Key: bar.b
         ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
(9 rows)

h.yuan=# set enable_mergejoin=false;
SET
h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0;
                             QUERY PLAN
--------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..393.60 rows=10 width=24)
   Join Filter: ((foo.c < 0) AND (foo.a = bar.b))
   ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
   ->  Materialize  (cost=0.00..40.60 rows=2040 width=12)
         ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
(5 rows)

h.yuan=# set enable_hashjoin=true;
SET
h.yuan=# explain select * from foo left join bar on foo.a+foo.c = bar.b+bar.a;
                           QUERY PLAN
-----------------------------------------------------------------
 Hash Right Join  (cost=1.23..45.40 rows=102 width=24)
   Hash Cond: ((bar.b + bar.a) = (foo.a + foo.c))
   ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=1.10..1.10 rows=10 width=12)
         ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
(5 rows)
{code}

I don't expect all this happen in a single patch, but hope we can get to the 
right direction. Just my 2 cents.


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --------------------------------------------------------------------------------------
>
>                 Key: CALCITE-2973
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2973
>             Project: Calcite
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Lai Zhou
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 10000*10000), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

Reply via email to