[ 
https://issues.apache.org/jira/browse/CALCITE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839358#comment-16839358
 ] 

Ruben Quesada Lopez edited comment on CALCITE-2979 at 5/14/19 12:25 PM:
------------------------------------------------------------------------

[~zabetak], I have a general question regarding the design of this new 
block-based nested loop join algorithm. Let us say that the plan that you just 
wrote in your last comment implements an inner join:
{code}
Join(A.id > B.id; type=INNER)
  Scan(A)
  Scan(B)
==>
NestedLoop(blockSize=3)
  Scan(A)
  ElasticScan(table=B, query="OR(>(cor0[0],B.id), >(cor0[1],B.id), 
>(cor0[2],B.id)")
{code}

The result of the NestedLoop operator must contain a tupleA concatenated with a 
tupleB that match. If we have a block of correlated variables, and we translate 
the join condition as a filter with an OR condition, we would get the 
appropriate tupleB, but how does the NestedLoop know with which tuple(s) from A 
it must be concatenated? We know for sure that the tupleB match with cor0[0] or 
cor0[1] or cor0[2], but which one(s) exactly? Maybe I'm missing something, but 
I think this information will not be available from NestedLoop perspective.

The only scenario that I can think of where this strategy could work is a 
correlated SEMI (or ANTI) join, because only the left part is returned after 
the first match (or no match at all) with the right part, and we do not care 
which exact tuple from B matched. But in order to implement such strategy, we 
would need to inverse the correlated variable logic, and set it from right to 
left (instead of left to right):
{code}
-- fetch the departments with at least one employee
Join (dept.id=emp.deptId, type=SEMI)
  Scan (dept)
  Scan (emp)
==>
NestedLoop(blockSize=3, $cor0=emp)
  Filter( OR( =(dept.id, cor0[0].deptId), =(dept.id, cor0[1].deptId), 
=(dept.id, cor0[2].deptId) )
    Scan(dept)
  Scan(emp)
==>
NestedLoop(blockSize=3, $cor0=emp)
  ElasticScan(table=dept, query="OR( =(dept.id, cor0[0].deptId), =(dept.id, 
cor0[1].deptId), =(dept.id, cor0[2].deptId)" )
  Scan(emp)
{code}


was (Author: rubenql):
[~zabetak], I have a general question regarding the design of this new 
block-based nested loop join algorithm. Let use say that the plan that you just 
wrote in your last comment implements an inner join:
{code}
Join(A.id > B.id; type=INNER)
  Scan(A)
  Scan(B)
==>
NestedLoop(blockSize=3)
  Scan(A)
  ElasticScan(table=B, query="OR(>(cor0[0],B.id), >(cor0[1],B.id), 
>(cor0[2],B.id)")
{code}

The result of the NestedLoop operator must contain a tupleA concatenated with a 
tupleB that match. If we have a block of correlated variables, and we translate 
the join condition as a filter with an OR conditions, we would get the 
appropriate tupleB, but how does the NestedLoop know with which tuple(s) from A 
it must be concatenated. We know for sure that the tupleB match with cor0[0] or 
cor0[1] or cor0[2], but which one(s) exactly? Maybe I'm missing something, but 
this information will not be available from NestedLoop perspective.

The only scenario that I can think of where this strategy could work is a 
correlated SEMI (or ANTI) join, because only the left part is returned after 
the first match (or no match at all) with the right part, and we do not care 
which exact tuple from B matched. But in order to implement such strategy, we 
would to inverse the correlated variable logic, and set it from right to left 
(instead of left to right):
{code}
-- fetch the depts with at least one emp
Join (dept.id=emp.dptoId, type=SEMI)
  Scan (dept)
  Scan (emp)
==>
NestedLoop(blockSize=3, $cor0=emp)
  Filter( OR( =(cor0[0].dptId, dept.id), =(cor0[1].dptId, dept.id), 
=(cor0[2].dptId, dept.id) )
    Scan(dept)
  Scan(emp)
==>
NestedLoop(blockSize=3, $cor0=emp)
  ElasticScan(table=dept, query="OR( =(cor0[0].dptId, dept.id), 
=(cor0[1].dptId, dept.id), =(cor0[2].dptId, dept.id)" )
  Scan(emp)
{code}

> Add a block-based nested loop join algorithm
> --------------------------------------------
>
>                 Key: CALCITE-2979
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2979
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Khawla Mouhoubi
>            Priority: Major
>              Labels: performance
>
> Currently, Calcite provides a tuple-based nested loop join algorithm 
> implemented through EnumerableCorrelate and EnumerableDefaults.correlateJoin. 
> This means that for each tuple of the outer relation we probe (set variables) 
> in the inner relation.
> The goal of this issue is to add new algorithm (or extend the correlateJoin 
> method) which first gathers blocks (batches) of tuples from the outer 
> relation and then probes the inner relation once per block.
> There are cases (eg., indexes) where the inner relation can be accessed by 
> more than one value which can greatly improve the performance in particular 
> when the outer relation is big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to