[ https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822079#comment-16822079 ]
Haisheng Yuan commented on CALCITE-2973: ---------------------------------------- EnumerableJoin and EnumerableThetaJoin are not good name for physical operators, we should get rid of them. As physical operators, they failed to indicate how the join is performed. I don't know if there is any other commercial databases showing Join or ThetaJoin in the execution plan. Do we really care about equijoin or non-equijoin? I think what we really care about is whether we can create hash join alternative for it, no matter it is equijoin or non-equijoin. So for non-correlated join we only need 3 physical joins, no more, no less: EnumerableNestedLoopJoin, EnumerableHashJoin, EnumerableMergeJoin. Current EnumerableThetaJoin should be rename to EnumerableNestedLoopJoin, EnumerableJoin should be renamed to EnumerableHashJoin. Both EnumerableHashJoin and EnumerableMergeJoin should be extended to be able to deal with non-equijoin, as long as there is a join condition with equality operator and both sides of the operator uses columns from each single relation. Moreover, it will be nice to enable merge join to deal with join condition with range comparison, as Julian mentioned above. Postgres already gives us a very good example: {code:sql} h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0; QUERY PLAN ----------------------------------------------------------------- Hash Right Join (cost=1.23..40.55 rows=10 width=24) Hash Cond: (bar.b = foo.a) Join Filter: (foo.c < 0) -> Seq Scan on bar (cost=0.00..30.40 rows=2040 width=12) -> Hash (cost=1.10..1.10 rows=10 width=12) -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=12) (6 rows) h.yuan=# set enable_hashjoin=false; h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0; QUERY PLAN -------------------------------------------------------------------- Merge Left Join (cost=143.81..155.33 rows=10 width=24) Merge Cond: (foo.a = bar.b) Join Filter: (foo.c < 0) -> Sort (cost=1.27..1.29 rows=10 width=12) Sort Key: foo.a -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=12) -> Sort (cost=142.54..147.64 rows=2040 width=12) Sort Key: bar.b -> Seq Scan on bar (cost=0.00..30.40 rows=2040 width=12) (9 rows) h.yuan=# set enable_mergejoin=false; SET h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0; QUERY PLAN -------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..393.60 rows=10 width=24) Join Filter: ((foo.c < 0) AND (foo.a = bar.b)) -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=12) -> Materialize (cost=0.00..40.60 rows=2040 width=12) -> Seq Scan on bar (cost=0.00..30.40 rows=2040 width=12) (5 rows) h.yuan=# set enable_hashjoin=true; SET h.yuan=# explain select * from foo left join bar on foo.a+foo.c = bar.b+bar.a; QUERY PLAN ----------------------------------------------------------------- Hash Right Join (cost=1.23..45.40 rows=102 width=24) Hash Cond: ((bar.b + bar.a) = (foo.a + foo.c)) -> Seq Scan on bar (cost=0.00..30.40 rows=2040 width=12) -> Hash (cost=1.10..1.10 rows=10 width=12) -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=12) (5 rows) {code} I don't expect all this happen in a single patch, but hope we can get to the right direction. Just my 2 cents. > Allow theta joins that have equi conditions to be executed using a hash join > algorithm > -------------------------------------------------------------------------------------- > > Key: CALCITE-2973 > URL: https://issues.apache.org/jira/browse/CALCITE-2973 > Project: Calcite > Issue Type: New Feature > Components: core > Affects Versions: 1.19.0 > Reporter: Lai Zhou > Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Now the EnumerableMergeJoinRule only supports an inner and equi join. > If users make a theta-join query for a large dataset (such as 10000*10000), > the nested-loop join process will take dozens of time than the sort-merge > join process . > So if we can apply merge-join or hash-join rule for a theta join, it will > improve the performance greatly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)