Re: Optimization of join between a small table and a large table
When you implement an equi-join as a nested loops join, the right-hand side always has a filter combining the variable set by the left-hand side and the join column on the right-hand side. You will need to do a full scan of the right-hand table every time, unless your table is organized so that you can access a subset that matches the filter. In your example select emp.name from emp join dept on emp.deptid = dept.id where dept.name = ‘Sales' The simple plan would be Correlate (v := deptno) Filter name = ’Sales' Scan dept Filter deptno = $v Scan emp If “emp” was physically ordered by deptno, the plan could be optimized to Correlate (v := deptno) Filter name = ’Sales' Scan dept RangeScan emp deptno = $v > On Aug 7, 2019, at 11:41 PM, Gabriel Reid wrote: > > Hi Stamatis, > > Thank you so much, this is exactly the kind of info I was looking for! I > think I can figure out how to go forward based on this, thanks again. > > - Gabriel > > On Thu, Aug 8, 2019 at 8:19 AM Stamatis Zampetakis > wrote: > >> Hi Gabriel, >> >> What you want indeed seems to be a nested loop join; there has been a >> relevant discussion in the dev list [1]. >> You may also find relevant the ongoing discussion on CALCITE-2979 [2]. >> >> Best, >> Stamatis >> >> [1] >> >> https://lists.apache.org/thread.html/d9f95683e66009872a53e7e617295158b98746b550d2bf68230b3096@%3Cdev.calcite.apache.org%3E >> >> [2] https://issues.apache.org/jira/browse/CALCITE-2979 >> >> On Wed, Aug 7, 2019 at 2:56 PM Gabriel Reid >> wrote: >> >>> Hi, >>> >>> I'm currently working on a custom Calcite adapter, and I've got a >> situation >>> where I want to join a small table with a large table, basically just >> using >>> the small table to filter the large table. Conceptually, I'm talking >> about >>> something that is equivalent to the following query: >>> >>>select emp.name >>>from emp join dept on emp.deptid = dept.id >>>where dept.name = 'Sales' >>> >>> I've got converter rules which to push down filtering on all tables, >> which >>> make a big difference in performance. The above query current results in >> an >>> EnumerableHashJoin over a filtered scan over the 'dept' table, and an >>> unfiltered scan over the 'emp' table. >>> >>> What I would like to accomplish is that this is converted into a (I >> think) >>> a nested loop join between 'dept' and emp, so that the filtered scan is >>> done once over 'dept', and then a filtered scan is done for each entry of >>> the 'dept' table using the 'id' value from that entry as a filter value >> on >>> the scan on the 'emp' table. >>> >>> I've been able to get a nested loop join to be used, but I haven't >> managed >>> to have the 'id' values from the 'dept' table to be used to filter the >>> 'emp' table. Instead, the full 'emp' table is scanned for each iteration >> of >>> the 'dept' table. >>> >>> And now my questions: are my hopes/expectations here realistic, and/or is >>> there a much better/easier way of accomplishing the same thing? Is there >>> prior art somewhere within the Calcite code base or elsewhere? Or should >>> this just be working by default? >>> >>> I would assume that this isn't that unusual of a situation, which is why >> I >>> was expecting that there would already be something like this somewhere >> (or >>> I'm doing the wrong thing), but I haven't managed to find any clear >>> pointers in any one direction yet. >>> >>> Thanks in advance for any advice! >>> >>> - Gabriel >>> >>
Re: Optimization of join between a small table and a large table
Hi Stamatis, Thank you so much, this is exactly the kind of info I was looking for! I think I can figure out how to go forward based on this, thanks again. - Gabriel On Thu, Aug 8, 2019 at 8:19 AM Stamatis Zampetakis wrote: > Hi Gabriel, > > What you want indeed seems to be a nested loop join; there has been a > relevant discussion in the dev list [1]. > You may also find relevant the ongoing discussion on CALCITE-2979 [2]. > > Best, > Stamatis > > [1] > > https://lists.apache.org/thread.html/d9f95683e66009872a53e7e617295158b98746b550d2bf68230b3096@%3Cdev.calcite.apache.org%3E > > [2] https://issues.apache.org/jira/browse/CALCITE-2979 > > On Wed, Aug 7, 2019 at 2:56 PM Gabriel Reid > wrote: > > > Hi, > > > > I'm currently working on a custom Calcite adapter, and I've got a > situation > > where I want to join a small table with a large table, basically just > using > > the small table to filter the large table. Conceptually, I'm talking > about > > something that is equivalent to the following query: > > > > select emp.name > > from emp join dept on emp.deptid = dept.id > > where dept.name = 'Sales' > > > > I've got converter rules which to push down filtering on all tables, > which > > make a big difference in performance. The above query current results in > an > > EnumerableHashJoin over a filtered scan over the 'dept' table, and an > > unfiltered scan over the 'emp' table. > > > > What I would like to accomplish is that this is converted into a (I > think) > > a nested loop join between 'dept' and emp, so that the filtered scan is > > done once over 'dept', and then a filtered scan is done for each entry of > > the 'dept' table using the 'id' value from that entry as a filter value > on > > the scan on the 'emp' table. > > > > I've been able to get a nested loop join to be used, but I haven't > managed > > to have the 'id' values from the 'dept' table to be used to filter the > > 'emp' table. Instead, the full 'emp' table is scanned for each iteration > of > > the 'dept' table. > > > > And now my questions: are my hopes/expectations here realistic, and/or is > > there a much better/easier way of accomplishing the same thing? Is there > > prior art somewhere within the Calcite code base or elsewhere? Or should > > this just be working by default? > > > > I would assume that this isn't that unusual of a situation, which is why > I > > was expecting that there would already be something like this somewhere > (or > > I'm doing the wrong thing), but I haven't managed to find any clear > > pointers in any one direction yet. > > > > Thanks in advance for any advice! > > > > - Gabriel > > >
Re: Optimization of join between a small table and a large table
Hi Gabriel, What you want indeed seems to be a nested loop join; there has been a relevant discussion in the dev list [1]. You may also find relevant the ongoing discussion on CALCITE-2979 [2]. Best, Stamatis [1] https://lists.apache.org/thread.html/d9f95683e66009872a53e7e617295158b98746b550d2bf68230b3096@%3Cdev.calcite.apache.org%3E [2] https://issues.apache.org/jira/browse/CALCITE-2979 On Wed, Aug 7, 2019 at 2:56 PM Gabriel Reid wrote: > Hi, > > I'm currently working on a custom Calcite adapter, and I've got a situation > where I want to join a small table with a large table, basically just using > the small table to filter the large table. Conceptually, I'm talking about > something that is equivalent to the following query: > > select emp.name > from emp join dept on emp.deptid = dept.id > where dept.name = 'Sales' > > I've got converter rules which to push down filtering on all tables, which > make a big difference in performance. The above query current results in an > EnumerableHashJoin over a filtered scan over the 'dept' table, and an > unfiltered scan over the 'emp' table. > > What I would like to accomplish is that this is converted into a (I think) > a nested loop join between 'dept' and emp, so that the filtered scan is > done once over 'dept', and then a filtered scan is done for each entry of > the 'dept' table using the 'id' value from that entry as a filter value on > the scan on the 'emp' table. > > I've been able to get a nested loop join to be used, but I haven't managed > to have the 'id' values from the 'dept' table to be used to filter the > 'emp' table. Instead, the full 'emp' table is scanned for each iteration of > the 'dept' table. > > And now my questions: are my hopes/expectations here realistic, and/or is > there a much better/easier way of accomplishing the same thing? Is there > prior art somewhere within the Calcite code base or elsewhere? Or should > this just be working by default? > > I would assume that this isn't that unusual of a situation, which is why I > was expecting that there would already be something like this somewhere (or > I'm doing the wrong thing), but I haven't managed to find any clear > pointers in any one direction yet. > > Thanks in advance for any advice! > > - Gabriel >
Optimization of join between a small table and a large table
Hi, I'm currently working on a custom Calcite adapter, and I've got a situation where I want to join a small table with a large table, basically just using the small table to filter the large table. Conceptually, I'm talking about something that is equivalent to the following query: select emp.name from emp join dept on emp.deptid = dept.id where dept.name = 'Sales' I've got converter rules which to push down filtering on all tables, which make a big difference in performance. The above query current results in an EnumerableHashJoin over a filtered scan over the 'dept' table, and an unfiltered scan over the 'emp' table. What I would like to accomplish is that this is converted into a (I think) a nested loop join between 'dept' and emp, so that the filtered scan is done once over 'dept', and then a filtered scan is done for each entry of the 'dept' table using the 'id' value from that entry as a filter value on the scan on the 'emp' table. I've been able to get a nested loop join to be used, but I haven't managed to have the 'id' values from the 'dept' table to be used to filter the 'emp' table. Instead, the full 'emp' table is scanned for each iteration of the 'dept' table. And now my questions: are my hopes/expectations here realistic, and/or is there a much better/easier way of accomplishing the same thing? Is there prior art somewhere within the Calcite code base or elsewhere? Or should this just be working by default? I would assume that this isn't that unusual of a situation, which is why I was expecting that there would already be something like this somewhere (or I'm doing the wrong thing), but I haven't managed to find any clear pointers in any one direction yet. Thanks in advance for any advice! - Gabriel