Re: Optimization of join between a small table and a large table

Julian Hyde Thu, 08 Aug 2019 13:27:31 -0700

When you implement an equi-join as a nested loops join, the right-hand side 
always has a filter combining the variable set by the left-hand side and the 
join column on the right-hand side.


You will need to do a full scan of the right-hand table every time, unless your 
table is organized so that you can access a subset that matches the filter.

In your example

 select emp.name
   from emp join dept on emp.deptid = dept.id
       where dept.name = ‘Sales'

The simple plan would be

 Correlate (v := deptno)
   Filter name = ’Sales'
     Scan dept
   Filter deptno = $v
     Scan emp

If “emp” was physically ordered by deptno, the plan could be optimized to

 Correlate (v := deptno)
   Filter name = ’Sales'
     Scan dept
   RangeScan emp deptno = $v



> On Aug 7, 2019, at 11:41 PM, Gabriel Reid <gabriel.r...@gmail.com> wrote:
> 
> Hi Stamatis,
> 
> Thank you so much, this is exactly the kind of info I was looking for! I
> think I can figure out how to go forward based on this, thanks again.
> 
> - Gabriel
> 
> On Thu, Aug 8, 2019 at 8:19 AM Stamatis Zampetakis <zabe...@gmail.com>
> wrote:
> 
>> Hi Gabriel,
>> 
>> What you want indeed seems to be a nested loop join; there has been a
>> relevant discussion in the dev list [1].
>> You may also find relevant the ongoing discussion on CALCITE-2979 [2].
>> 
>> Best,
>> Stamatis
>> 
>> [1]
>> 
>> https://lists.apache.org/thread.html/d9f95683e66009872a53e7e617295158b98746b550d2bf68230b3096@%3Cdev.calcite.apache.org%3E
>> 
>> [2] https://issues.apache.org/jira/browse/CALCITE-2979
>> 
>> On Wed, Aug 7, 2019 at 2:56 PM Gabriel Reid <gabriel.r...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I'm currently working on a custom Calcite adapter, and I've got a
>> situation
>>> where I want to join a small table with a large table, basically just
>> using
>>> the small table to filter the large table. Conceptually, I'm talking
>> about
>>> something that is equivalent to the following query:
>>> 
>>>    select emp.name
>>>    from emp join dept on emp.deptid = dept.id
>>>        where dept.name = 'Sales'
>>> 
>>> I've got converter rules which to push down filtering on all tables,
>> which
>>> make a big difference in performance. The above query current results in
>> an
>>> EnumerableHashJoin over a filtered scan over the 'dept' table, and an
>>> unfiltered scan over the 'emp' table.
>>> 
>>> What I would like to accomplish is that this is converted into a (I
>> think)
>>> a nested loop join between 'dept' and emp, so that the filtered scan is
>>> done once over 'dept', and then a filtered scan is done for each entry of
>>> the 'dept' table using the 'id' value from that entry as a filter value
>> on
>>> the scan on the 'emp' table.
>>> 
>>> I've been able to get a nested loop join to be used, but I haven't
>> managed
>>> to have the 'id' values from the 'dept' table to be used to filter the
>>> 'emp' table. Instead, the full 'emp' table is scanned for each iteration
>> of
>>> the 'dept' table.
>>> 
>>> And now my questions: are my hopes/expectations here realistic, and/or is
>>> there a much better/easier way of accomplishing the same thing? Is there
>>> prior art somewhere within the Calcite code base or elsewhere? Or should
>>> this just be working by default?
>>> 
>>> I would assume that this isn't that unusual of a situation, which is why
>> I
>>> was expecting that there would already be something like this somewhere
>> (or
>>> I'm doing the wrong thing), but I haven't managed to find any clear
>>> pointers in any one direction yet.
>>> 
>>> Thanks in advance for any advice!
>>> 
>>> - Gabriel
>>> 
>>

Re: Optimization of join between a small table and a large table

Reply via email to