[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-11 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815943#comment-16815943
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~julianhyde],[~zabetak],[~hyuan]

I make a PR to improve the EnumerableJoin.

Since EnumerableMergeJoin is never taken ,I change the summary to "Allow theta 
joins that have equi conditions to be executed using a hash join algorithm."

Now  a join rel node will be converted  to an EnumerableJoin if it has mixed 
equi and non-equi conditions.

see 
[https://github.com/apache/calcite/blob/2251c82f209612d8ae31e2e7a42acdb2bcb15d55/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableJoinRule.java#L62|https://github.com/apache/calcite/blob/2251c82f209612d8ae31e2e7a42acdb2bcb15d55/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableJoinRule.java#L62]

Now EnumerableJoin can handle a per-row condition, I introduce a the 
remainCondition to generate the predicate for the join.

see

[https://github.com/apache/calcite/blob/2251c82f209612d8ae31e2e7a42acdb2bcb15d55/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableJoin.java#L250|https://github.com/apache/calcite/blob/2251c82f209612d8ae31e2e7a42acdb2bcb15d55/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableJoin.java#L250]

I also introduce a new  method to support join with predicate,  it doesn't 
affect  the old join method .

see

[https://github.com/apache/calcite/blob/2251c82f209612d8ae31e2e7a42acdb2bcb15d55/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L1061|https://github.com/apache/calcite/blob/2251c82f209612d8ae31e2e7a42acdb2bcb15d55/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L1061]

 

 

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-12 Thread Danny Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816069#comment-16816069
 ] 

Danny Chan commented on CALCITE-2973:
-

Hi, [~hhlai1990], i think this issue has strong association with CALCITE-2969, 
so i add a link here, in the patch, i deprecate the EnumerableThetaJoin but do 
not change the implementation of the algorithm, i planned to implement it in 
another patch.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-16 Thread Stamatis Zampetakis (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818940#comment-16818940
 ] 

Stamatis Zampetakis commented on CALCITE-2973:
--

In terms of code re-use, it would seem more natural to treat only the equality 
condition part in the join and leave the remaining condition to be treated 
afterwards. As Julian mentioned when there are outer joins involved, the filter 
cannot be applied after the join but I have the impression that a projection 
could achieve the same result (i.e., nullify the left/right part when a certain 
condition holds). The additional benefit is that if we could break a theta join 
into an equijoin plus filter/projection (using a rule) this could be exploited 
by more users. 

In terms of semantics, having the join operator do all the job is more 
intuitive and the plan is easier to understand so in the end I haven't made up 
my mind what is the best approach.  


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-16 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819701#comment-16819701
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~zabetak], great thanks to your suggestion. I'll  take it into account  and 
give you feedback later.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-18 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820902#comment-16820902
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~zabetak], I can't find a good way to break a theta join into an equi-join + 
filter/projection , I think it will also make the rules hard to understand.

But I found another simple and clear way , please see the latest commit 
:[[https://github.com/apache/calcite/pull/1156/files]|[https://github.com/apache/calcite/pull/1156/files]]

We still keep the EquiJoin as a pure equil join without remain condition.

For a theta join, as Calcite defined in the EnumerableJoinRule,
{code:java}
!info.isEqui() && join.getJoinType() != JoinRelType.INNER{code}
 

if it has equi keys, we can use a hash-join or merge-join instead of 
nested-loop-join to improve the performance .

So I introduced a new join rel named `EnumerableThetaHashJoin ` . In addition , 
I found there are some difference  between algorithms of pure hash join and 
hash join with remain condition :

When we implement a pure hash join , we just need to compare the hash join keys 
, but when we implement a hash join with remain condition, we need to compare 
some other columns to find the unmatched records.

So I introduced a new method named `thetaHashJoin` in EnumerableDefaults.

 

 

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-18 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821642#comment-16821642
 ] 

Haisheng Yuan commented on CALCITE-2973:


I don't like the idea of breaking theta join into equi-join and filter/project. 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-18 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821654#comment-16821654
 ] 

Haisheng Yuan commented on CALCITE-2973:


{quote}
For the record, here is a non-equi join where you can use a hash join plus a 
per-row condition, where the 'equi' part of the condition is fairly selective 
and therefore hash join makes sense. It has to be done as a per-row condition, 
rather than a filter after the join, because of the left outer.
{quote}
[~julianhyde] We have different definition of equi join. The Oracle's 
definition of equi join makes more sense:
https://docs.oracle.com/cd/E11882_01/server.112/e41084/queries006.htm#SQLRF52350

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-19 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822079#comment-16822079
 ] 

Haisheng Yuan commented on CALCITE-2973:


EnumerableJoin and EnumerableThetaJoin are not good name for physical 
operators, we should get rid of them. As physical operators, they failed to 
indicate how the join is performed. I don't know if there is any other 
commercial databases showing Join or ThetaJoin in the execution plan. Do we 
really care about equijoin or non-equijoin? I think what we really care about 
is whether we can create hash join alternative for it, no matter it is equijoin 
or non-equijoin. So for non-correlated join we only need 3 physical joins, no 
more, no less:
EnumerableNestedLoopJoin, EnumerableHashJoin, EnumerableMergeJoin.

Current EnumerableThetaJoin should be rename to EnumerableNestedLoopJoin, 
EnumerableJoin should be renamed to EnumerableHashJoin. Both EnumerableHashJoin 
and EnumerableMergeJoin should be extended to be able to deal with 
non-equijoin, as long as there is a join condition with equality operator and 
both sides of the operator uses columns from each single relation. Moreover, it 
will be nice to enable merge join to deal with join condition with range 
comparison, as Julian mentioned above.

Postgres already gives us a very good example:

{code:sql}
h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0;
   QUERY PLAN
-
 Hash Right Join  (cost=1.23..40.55 rows=10 width=24)
   Hash Cond: (bar.b = foo.a)
   Join Filter: (foo.c < 0)
   ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=1.10..1.10 rows=10 width=12)
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
(6 rows)

h.yuan=# set enable_hashjoin=false;
h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0;
 QUERY PLAN

 Merge Left Join  (cost=143.81..155.33 rows=10 width=24)
   Merge Cond: (foo.a = bar.b)
   Join Filter: (foo.c < 0)
   ->  Sort  (cost=1.27..1.29 rows=10 width=12)
 Sort Key: foo.a
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
   ->  Sort  (cost=142.54..147.64 rows=2040 width=12)
 Sort Key: bar.b
 ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
(9 rows)

h.yuan=# set enable_mergejoin=false;
SET
h.yuan=# explain select * from foo left join bar on foo.a = bar.b and foo.c < 0;
 QUERY PLAN

 Nested Loop Left Join  (cost=0.00..393.60 rows=10 width=24)
   Join Filter: ((foo.c < 0) AND (foo.a = bar.b))
   ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
   ->  Materialize  (cost=0.00..40.60 rows=2040 width=12)
 ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
(5 rows)

h.yuan=# set enable_hashjoin=true;
SET
h.yuan=# explain select * from foo left join bar on foo.a+foo.c = bar.b+bar.a;
   QUERY PLAN
-
 Hash Right Join  (cost=1.23..45.40 rows=102 width=24)
   Hash Cond: ((bar.b + bar.a) = (foo.a + foo.c))
   ->  Seq Scan on bar  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=1.10..1.10 rows=10 width=12)
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=12)
(5 rows)
{code}

I don't expect all this happen in a single patch, but hope we can get to the 
right direction. Just my 2 cents.


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-19 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822133#comment-16822133
 ] 

Haisheng Yuan commented on CALCITE-2973:


Every physical join, no matter inner join or outer join, should have both 
equi-conditions and non-equi-conditions, give it empty list there is no such 
conditions. NestedLoopJoin is a special case, because it doesn't care it is 
equi or not, we can combine 2 parts into 1 for NLJ.  It is not worthwhile to 
have join methods specifically for equi only conditions, which will be hard to 
maintain.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-20 Thread Stamatis Zampetakis (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822387#comment-16822387
 ] 

Stamatis Zampetakis commented on CALCITE-2973:
--

bq. Both EnumerableHashJoin and EnumerableMergeJoin should be extended to be 
able to deal with non-equijoin, as long as there is a join condition with 
equality operator and both sides of the operator uses columns from each single 
relation

I think this JIRA/PR are doing the job for EnumerableHashJoin so I guess we 
could merge it back quite easily. 

[~hhlai1990], as [~hyuan] said probably the only thing that needs to be changed 
is instead of creating a new class EnumerableThetaHashJoin you should rather 
modify EnumerableJoin.

The various renamings of joins are  in progress in CALCITE-2969, so I don't 
think we should worry about that here. 


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-22 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823090#comment-16823090
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~zabetak],[~hyuan], should we keep EnumerableJoin  as an `EquiJoin`or change 
it to extend `Join`?

I have a try to change it to extend `Join`, but the FilterJoinRule can't work. 
It can't push down the remain condition into a filter after an inner join 
correctly.

see 
[https://github.com/apache/calcite/blob/ee83efd360793ef4201f4cdfc2af8d837b76ca69/core/src/main/java/org/apache/calcite/rel/rules/FilterJoinRule.java#L165|https://github.com/apache/calcite/blob/ee83efd360793ef4201f4cdfc2af8d837b76ca69/core/src/main/java/org/apache/calcite/rel/rules/FilterJoinRule.java#L165]

If we keep EnumerableJoin  as an `EquiJoin`, we need to introduce a field for 
EnumerableJoin to reference the join condition, because we need to extract 
remain part condition of it.So what's the better way?

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-04-22 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823659#comment-16823659
 ] 

Lai Zhou commented on CALCITE-2973:
---

I modified EnumerableJoin to be able to deal with non-equi join that has equi 
conditions.

I didn't rename the EnumerableJoin this time, we can rename it to 
`EnumerableHashJoin` in next patch later.

Now EnumerableDefaults's method `join_`  implemented the hash join  algorithm 
for a join , no matter it has a non-equi condition or not.

 

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-05 Thread Stamatis Zampetakis (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833468#comment-16833468
 ] 

Stamatis Zampetakis commented on CALCITE-2973:
--

I had a quick look in the PR and seems to be in good shape. Let's try to get it 
in 1.20.0. I will try to do a proper review when I find some time.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-10 Thread Stamatis Zampetakis (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837476#comment-16837476
 ] 

Stamatis Zampetakis commented on CALCITE-2973:
--

Today I was looking again into another case (CALCITE-2898) where there is a 
need to perform a hash join with a join condition that is not strictly an 
equijoin.

*Example from CALCITE-2898*
{code:sql}
SELECT e.name
FROM emp e
INNER JOIN department d 
  ON e.address.zipcode = d.zipcode
{code}
As you can observe the condition incorporates RexFieldAccess so it does not 
satisfy the requirement RexInputRef = RexInputRef. I have the impression that 
the PR in this issue does not handle this case. We could transform this theta 
join to an equijoin by introducing an additional projection below the join.

*Example from CALCITE-2973*
{code:sql}
SELECT e.ename, d.name
FROM emp e
LEFT JOIN dept d
  ON e.deptno = d.deptno  AND e.sal < 1
{code}
The query above has the following logical plan.
{noformat}
LogicalProject(ENAME=[$1], NAME=[$11])
  LogicalJoin(condition=[AND(=($7, $10), $9)], joinType=[left])
LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], 
SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], $f9=[<($5, 1)])
  LogicalTableScan(table=[[CATALOG, SALES, EMP]])
LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
{noformat}
I think it is equivalent to the plans below.
{noformat}
LogicalProject(ENAME=[$1], NAME=[$11])
  LogicalProject($0..$9, EX$10=[CASE($9,$10,null)], EX$11=[CASE($9,$11,null), 
...]])
LogicalJoin(condition=[=($7, $10)], joinType=[left])
  LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], 
SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], $f9=[<($5, 1)])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
  LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
{noformat}
(after merging the projections)
{noformat}
LogicalProject(ENAME=[$1], NAME=[CASE($9,$11,null)])
  LogicalJoin(condition=[=($7, $10)], joinType=[left])
LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], 
SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], $f9=[<($5, 1)])
  LogicalTableScan(table=[[CATALOG, SALES, EMP]])
LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
{noformat}
Observe that both cases mentioned above can be solved by adding projections 
above/below the join operator without touching at all the join it self. 
Due to this and given that most join algorithms in the literature cannot handle 
theta-joins, I ended up again to the question below:

Should we modify the implementation of our joins algorithms or rather try to 
intro introduce new rule(s) (e.g., ThetaJoinToEquiJoinRule) which can perform 
transformations like those demonstrated above?

I was thinking that the rule based approach can be useful for a greater 
audience so I would like again your input on this (in particular from [~hyuan] 
who seemed to be rather against this approach).

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-12 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838246#comment-16838246
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~zabetak], the query as you said,
{code:java}
SELECT e.name FROM emp e INNER JOIN department d ON e.address.zipcode = 
d.zipcode{code}
I add a test for it, and I found the RexFieldAccess `e.address.zipcode` would 
be converted to a new RexInputRef , that was made by JoinPushExpressionsRule,

see 
[https://github.com/apache/calcite/blob/6afa38bae794462e6e250237a1b60cc4220b2885/core/src/main/java/org/apache/calcite/plan/RelOptUtil.java#L3290].

Please see the latest commit, there's a test named 
`leftOuterJoinWithPredicateContainsRexFieldAccess` in EnumerableJoinTest.

I admit the rule based approach you proposed is also good for this issue. But I 
still think it's a little complicated, and it seems to increase the overhead of 
computation if we introduce a new projection.

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-13 Thread Ruben Quesada Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838592#comment-16838592
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

[~hhlai1990], I have just checked the PR, it looks very promising.
I have a small doubt that I would like to share. Regarding "partial equi" 
joins, we would have two possibilities:
- 1. Partial equi non-inner join: Enumerable(Hash)Join with remaining condition
- 2. Partial equi inner join: : Enumerable(Hash)Join (remaining condition null) 
+ EnumerableFilter with remaining condition

For the sake of consistency and code simplicity, I wonder what is the advantage 
of 2 vs 1 (if any), and if we should not remove option 2 and handle the inner 
join case also using approach 1.



> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-13 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838611#comment-16838611
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~rubenql], I agree with you. It's a good idea to using approach1 to  handle 
the inner join case.

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-14 Thread Stamatis Zampetakis (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839561#comment-16839561
 ] 

Stamatis Zampetakis commented on CALCITE-2973:
--

It seems that the majority ([~hhlai1990], [~hyuan], [~julianhyde], [~rubenql]) 
believes that changing the operator is better (or at least less complex) than 
adding a new rule. If that's the case I am willing to follow. 

[~rubenql] from your comments it seems that you have done a rather exhaustive 
review. Don't hesitate to merge the PR if you think it is done. You can mark it 
as LGTM-will-merge-soon and if nobody complains over the next few days you can 
proceed with the merge.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-14 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839965#comment-16839965
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~rubenql], now the inner join with a remainCondtion won't be converted to an 
inner-join and a filter , the Enumerable(Hash)Join can handle it.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-15 Thread Ruben Quesada Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840140#comment-16840140
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

The problem seems to be in {{SemiJoinRule.java}}, which "creates a SemiJoin 
from a Join on top of an Aggregate". The problem here, is that the PR is 
generating a Join with condition=true and remainCondition=>($2, $0) , but the 
SemiJoinRule (and I would tend to say, any existing rule involving a Join as 
operator) is not aware of this new "remainCondition" attribute, so it just 
takes the join condition (i.e. 'true') to create the SemiJoin (and the 
remainCondition is lost).
This specific issue might be solved if {{Join#analyzeCondition}} (and maybe 
subsequently {{JoinInfo#of}}) methods are modified: with this change, any join 
having a non-null (and non-always-true) remainCondition must have a 
NonEquiJoinInfo as a result.
In any case, looking into this case, I'm starting to have some doubts about the 
solution proposed in this PR, because it can potentially break any rule 
involving a Join, because from now on, such rules (and the potential new ones 
to be created) would have to consider the remainCondition predicate when 
processing their operators and generating their output, and I fear it could be 
something that can be easily missed.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-15 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840190#comment-16840190
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~rubenql],thanks , I understand it.

When creating a SemiJoin from a EnumerableJoin, the remainCondition is 
missed.Now it backs to my previous question:

Should we define the EnumerableJoin as an EquiJoin or  a pure Join?, if it's an 
EquiJoin, the condition just contains the equi part.

If we change  the EnumerableJoin to a pure join, it will cause some other 
problems , such as that, the FilterJoinRule can't work.

My initial solution is to introduce a  EnumerableThetaHashJoin to handle the 
non-inner join that contains a remainCondition.

This EnumerableThetaHashJoin is more like a EnumerableThetaJoin, which is a 
Join rather than an EquiJoin,

And EnumerableThetaHashJoin and Enumerable(Hash)Join can share the same hash 
join algorithm .

I think this solution is more clear and will do no harm to current rules.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-15 Thread Ruben Quesada Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840194#comment-16840194
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

Digging into my previous comment, I believe that there is an alternative that 
may do the job without breaking things (I think):
- EnumerableJoin will no longer extend EquiJoin.
- It will continue having only the original "condition" field (no need to add 
remainCondition as a new field).
- The "condition" will now be any type of condition (equi / non-equi)
- EnumerableJoinRule will generate EnumerableThetaJoin (i.e. NestedLoopJoin) 
for "pure non equi-joins"
- EnumerableJoinRule will generate EnumerableJoin (i.e. HashJoin, with possibly 
extra predicate) for "pure and partial equi-joins"
- Inside EnumerableJoin#implement method, the "remainCondition" will be 
calculated on the fly, using {{Join#analyzecondition}} (or {{JoinInfo#of}}) 
method. If the condition is pure equi, the remainCondition will be null; if the 
condition is not pure equi, the remaining condition will be taken from 
{{NonEquiJoinInfo.remaining}} and will be passed to the new 
{{EnumerableJoin#generatePredicate}} method to create the extra predicate to be 
passed to {{BuiltInMethod.HASH_JOIN.method}}, which can remain as it is right 
now in the PR.

[~hhlai1990], I'm not sure if my explanation above is clear, let me know if you 
have any questions or you see any issues on the logic behind.


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-15 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840251#comment-16840251
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~rubenql] , good analysis. I tested this solution, but there're still some 
failed tests, the report:
{code:java}
[ERROR] Tests run: 5018, Failures: 47, Errors: 7, Skipped: 115  
[ERROR] Errors: [ERROR] LatticeSuggesterTest.testEmpDept:76 » IndexOutOfBounds 
index (8) must be less ... [ERROR] 
LatticeSuggesterTest.testExpressionInAggregate:272 » IndexOutOfBounds index 
(3... [ERROR] LatticeSuggesterTest.testFoodMartAll:389->checkFoodMartAll:301 » 
IndexOutOfBounds [ERROR] 
LatticeSuggesterTest.testFoodMartAllEvolve:393->checkFoodMartAll:301 » 
IndexOutOfBounds [ERROR] LatticeSuggesterTest.testFoodmart:153 » 
IndexOutOfBounds index (17) must be le... [ERROR] 
LatticeSuggesterTest.testSharedSnowflake:264 » IndexOutOfBounds index (31) 
mus... [ERROR] 
MaterializationTest.testJoinMaterialization9:1825->checkMaterialize:202->checkMaterialize:210
 » SQL
{code}
Check the LatticeSuggesterTest.testSharedSnowflake , I found the 
!join.analyzeCondition().isEqui(),
did harm to this query. 

If I keep the line as 

 
{code:java}
!(join instanceof EquiJoin)
{code}
Almost All the reported failed tests will be success, except the 
MaterializationTest.testJoinMaterialization9. You can change this line to find 
more details.I think this modification is not safe.

 

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-15 Thread Ruben Quesada Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840269#comment-16840269
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

[~hhlai1990], I was just running the tests locally and I reached the same 
conclusion as you. I'll try to come up with a solution, but it seems tricky due 
to the "EquiJoin-oriented" design.
Otherwise, I think that your proposed solution of having a new 
EnumerableThetaHashJoin, and keeping the existing EnumerableThetaJoin (to be 
renamed as EnumerableNestedLoopJoin) and EnumerableJoin (to be renamed as 
EnumerableHashJoin) is the most straightforward and less harmful solution. But 
I'm not sure if others will agree.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-30 Thread Michael Mior (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851820#comment-16851820
 ] 

Michael Mior commented on CALCITE-2973:
---

That solution seems fine to me. [~hhlai1990] do you think this can be resolved 
in the next few days to make it into 1.20.0 or can we push this to the next 
version?

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-30 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852674#comment-16852674
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~rubenql],[~michaelmior], now the patch is good enough to be merged.

I adopt my initial solution to support the non inner join with mixed  
conditions(equi conditions and non-equi conditions):

introducing an EnumerablePredicativeHashJoin(before I call it 
EnumerableThetaHashJoin) .

The EnumerablePredicativeHashJoin and EnumerableHashJoin share the same hash 
join algorithm, but EnumerablePredicativeHashJoin extends Join rather than 
EquiJoin.

I believe this solution will do  no harm to current rules, but in the long 
term, we'd better change the EnumerableHashJoin to extend Join.

[~hyuan] created an issue to work on this, see 
https://issues.apache.org/jira/browse/CALCITE-3089.

So, I think we can resolved this issue first.

 

 

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-30 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852684#comment-16852684
 ] 

Haisheng Yuan commented on CALCITE-2973:


Agree, I think this patch can be done after 3089 is merged. Instead of creating 
a new class, I prefer modifying EnumerableHashJoin.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-31 Thread Ruben Quesada Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852748#comment-16852748
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

Since we are going towards the deprecation of EquiJoin (CALCITE-3089), I think 
it is not worth it to rush and include this ticket (and the new 
EnumerablePredicativeHashJoin class) in 1.20.
Hopefully, CALCITE-3089 could be included in 1.21, which means that the new 
functionality introduced by this patch will be absorbed by EnumerableHashJoin, 
and we would probably have to deprecate EnumerablePredicativeHashJoin.
IMHO it is better to wait for CALCITE-3089 before moving on this one.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-31 Thread Danny Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852772#comment-16852772
 ] 

Danny Chan commented on CALCITE-2973:
-

+1 to skip this patch for 1.20 version.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-05-31 Thread Michael Mior (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852961#comment-16852961
 ] 

Michael Mior commented on CALCITE-2973:
---

For now I've removed this from 1.20.0. If anyone has strong feelings about 
pushing it out in this release, feel free to speak up.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-21 Thread Lai Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912868#comment-16912868
 ] 

Lai Zhou commented on CALCITE-2973:
---

[~rubenql], [~hyuan] ,[~julianhyde], [~danny0405], the pr is ready ,would 
someone help to review it ?

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-21 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912893#comment-16912893
 ] 

Haisheng Yuan commented on CALCITE-2973:


Sure, will do.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-22 Thread Ruben Quesada Lopez (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913391#comment-16913391
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

[~hhlai1990], thanks for this PR, I think it generally looks in a good shape. I 
have one small concern though regarding the new implementation of 
{{EnumerableDefaults#hashJoin_}}.
TLDR; I think this change may impact performance of RIGHT / FULL hash equi 
joins.

The addition of the new predicate to support all types of join conditions (and 
not just equi-joins) requires to change the {{Set unmatchedKeys}} into a 
{{List innersUnmatched}}. If I understand correctly, it is required to 
do so because with this change we may have two inner results with the same TKey 
(which is based on the equi-condition), one being a match and the other not 
being an actual match due to the new extra (non-equi) predicate. This 
{{innersUnmatched}} list will be used in RIGHT / FULL joins to keep the results 
from the right input which had no match, in order to finally generate the 
results with null on the left. In case of INNER / LEFT join, this unmatch List 
(or Set in the previous version) will not be used, so what I am about to say is 
not really relevant. 
The problem with the new {{List innersUnmatched}} for the RIGHT / FULL 
joins is that it will be pre-filled with ALL the right input (inner) results, 
and the process will remove the ones which actually had a match. Previously, 
this removal was implemented via a {{HashSet#remove(TKey)}}, which has a better 
performance than the new {{ArrayList#removeAll(List)}}, specially in 
cases of RIGHT / FULL joins on big, big tables. This shall happen even if the 
new predicate is null (i.e. we are dealing with a pure equi-join). What I am 
trying to say is that, with this change, it can be expected a drop in the 
performance of RIGHT / FULL hash equi joins, compared to previous calcite 
versions, specially on big data volumes.
I have not make any measurements, so I'm not sure if the performance impact 
will be relevant or not, so maybe I am making a big deal out of nothing. 
A possible solution (although I'm not 100% convinced) could be keeping the 
"old" EnumerableDefaults method for hash joins (maybe rename it as 
{{hashEquiJoin_}}) and use it for pure equi joins. The new method {{hashJoin_}} 
in the PR with the extra predicate will implement any type of join condition 
(and will redirect to the previous one if predicate is null, i.e. if this is 
actually an equi join). Pros: same performance for hash equi joins is 
guaranteed. Cons: code duplicated, complex maintainability, etc.


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-22 Thread Lai Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913882#comment-16913882
 ] 

Lai Zhou commented on CALCITE-2973:
---

I noticed this performance issue before. I'll try to find a better way to 
handle it.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-26 Thread Ruben Quesada Lopez (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915599#comment-16915599
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

Thanks for taking care of it [~hhlai1990]. Now, since we have two 
implementations: the "old" for equi and the "new" (with extra predicate) for 
the non-equi, we should maintain equijoin performance of previous releases. The 
PR LGTM.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-27 Thread Ruben Quesada Lopez (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916457#comment-16916457
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
--

[~hyuan], [~danny0405], [~julianhyde], I think we could try to push this into 
1.21, what do you think?

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-27 Thread Danny Chan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916465#comment-16916465
 ] 

Danny Chan commented on CALCITE-2973:
-

CALCITE-2302 and CALCITE-1581 are almost ready to be merged, i think it's okey 
to push this patch into release-1.21 if you think it is in enough good shape.

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CALCITE-2973) Allow theta joins that have equi conditions to be executed using a hash join algorithm

2019-08-27 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916597#comment-16916597
 ] 

Stamatis Zampetakis commented on CALCITE-2973:
--

I think one +1 is enough for this PR; go ahead and merge it [~rubenql]!

> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --
>
> Key: CALCITE-2973
> URL: https://issues.apache.org/jira/browse/CALCITE-2973
> Project: Calcite
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 1.19.0
>Reporter: Lai Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.21.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 1*1), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)