Wait... that transformation works for AND but not for OR... On Thu, Sep 10, 2015 at 1:12 PM, Hsuan Yi Chu <hyi...@maprtech.com> wrote:
> Can we cartesian-join all the values in the in list and rewrite it as a > single in list: > > For example, > Say, the original where-clause is > > "a in (1, 2) or b in (3, 4)" > > Can we implement a rule to let calcite treat it as > > "(a, b) in ((1,3),(1,4),(2,3),(2,4))" > > -------------------------------------------------------------- > I understand the exponential growth in cartesian-join. But I am guessing > maybe in some circumstances, this does help, isn't it??? > > On Thu, Sep 10, 2015 at 11:04 AM, Jinfeng Ni <jinfengn...@gmail.com> > wrote: > >> I think Semi-join is not valid in this case, since the original query >> has 5 in-lists ORed together. If Semi-join is used, then the rows >> that does not qualify for the first 1 in-list filter would be pruned out, >> which is not valid, since they may qualify for the second in-list filter. >> >> That's why left outer join is used, if the planner decide to use join >> approach. The problem is that left outer join does not reduce the >> # of rows; essentially, Drill execution has to scan the input rows >> multiple times, making the join operator a bottleneck for the query. >> >> >> >> On Thu, Sep 10, 2015 at 10:40 AM, Hsuan Yi Chu <hyi...@maprtech.com> >> wrote: >> >> > I believe the usage of Semi-Join had been proposed before. >> > >> > Would that new operator help in this scenario you think? >> > >> > On Wed, Sep 9, 2015 at 8:16 PM, Jinfeng Ni <jinfengn...@gmail.com> >> wrote: >> > >> > > The reason that the in-list join approach is not fast enough : >> > > the query has 5 in-lists ORed together. Each in-list is converted >> > > to a left outer join. After the 5 left outer join, there is a filter. >> > > >> > > Since left outer join does not prune any row from left side, >> > > which is the base table in this case, essentially each join has >> > > to scan the same # of rows as the base table, and copy >> > > to the outgoing batch. That is, although the in-list evaluation >> > > is using hash-based probe, which is faster than the original >> > > filter evaluation, still 5 left out join incurs big overhead >> > > in scanning/copying the data. >> > > >> > > The UDF idea in #2 is essentially doing the same kind of hash-based >> > > probe in filter evaluation. The hash-table will be initialized as >> > > a workspace variable in the doSetup(). Then, the doEval() will >> > > simply probe the hash-table. I feel it would achieve the same >> > > benefit of join approach, while avoid the overhead of re-scanning >> > > the data multiple times. >> > > >> > > However, the current infrastructure seems miss the support >> > > of VarArg in Drill's build-in or UDF, which is required to implement >> > > this idea. >> > > >> > > >> > > >> > > On Wed, Sep 9, 2015 at 5:40 PM, Aman Sinha <amansi...@apache.org> >> wrote: >> > > >> > > > Yes, this would be a good enhancement. Any improvement to the >> > > > efficiency/compactness of the generated code is complimentary to >> other >> > > > optimizations such as parquet filter pushdown. I recall that there >> > was a >> > > > JIRA a while ago with hundreds or thousands of filter conditions >> > > creating a >> > > > really bloated generated code - we should revisit that at some >> point >> > to >> > > > identify scope for improvement. >> > > > I am not so sure about the UDF suggestion in #2. It seems like >> > > > identifying why the large IN-list join approach was slow and fixing >> > that >> > > > would be a general solution. >> > > > >> > > > Aman >> > > > >> > > > On Wed, Sep 9, 2015 at 1:31 PM, Jinfeng Ni <jinfengn...@gmail.com> >> > > wrote: >> > > > >> > > > > Weeks ago there was a message on drill user list, reporting >> > performance >> > > > > issues caused by in list filter [1]. The query has filter: >> > > > > >> > > > > WHERE >> > > > > c0 IN (v_00, v_01, v_02, v_03, ... ) >> > > > > OR >> > > > > c1 IN (v_11, v_11, v_12, v_13, ....) >> > > > > OR >> > > > > c2 IN ... >> > > > > OR >> > > > > c3 IN ... >> > > > > OR >> > > > > .... >> > > > > >> > > > > The profile shows that most of query time is spent on filter >> > > evaluation. >> > > > > One workaround that we recommend was to re-write the query so that >> > the >> > > > > planner would convert in list into join operation. Turns out that >> > > > > converting >> > > > > into join did help improve performance, but not as much as we >> wanted. >> > > > > >> > > > > The original query has parquet as the data source. Therefore, the >> > ideal >> > > > > solution is parquet filter pushdown, which DRILL-1950 would >> address. >> > > > > >> > > > > On the other hand, I noticed that there seems to be room for >> > > improvement >> > > > > in the run-time generated code. In particular, for " c0 in (v_00, >> > v_01, >> > > > > ...)", >> > > > > Drill will evaluate it as : >> > > > > c0 = v_00 OR c0 = v_01 OR ... >> > > > > >> > > > > Each reference of "c0" will lead to initialization of vector and >> > holder >> > > > > assignment in the generated code. There is redundant evaluation >> for >> > > > > the common reference. >> > > > > >> > > > > I put together a patch,which will avoid the redundant evaluation >> for >> > > the >> > > > > common reference. Using TPCH scale factor 10's lineitem table, I >> saw >> > > > > quite surprising improvement. (run on Mac with embedded drillbit) >> > > > > >> > > > > 1) In List uses integer type [2] >> > > > > master branch : 12.53 seconds >> > > > > patch on top of master branch : 7.073 seconds >> > > > > That's almost 45% improvement. >> > > > > >> > > > > 2) In List uses binary type [3] >> > > > > master branch : 198.668 seconds >> > > > > patch on top of master branch: 20.37 seconds >> > > > > >> > > > > Two thoughts: >> > > > > 1. Will code size impact Janino compiler optimization or jvm >> hotspot >> > > > > optimization? Otherwise, it seems hard to explain the performance >> > > > > difference of removing the redundant evaluation. That might imply >> > > > > that the efficiency of run-time generated code may degrade with >> > > > > more expressions in the query (?) >> > > > > >> > > > > 2. For In-List filter, it might make sense to create a Drill UDF. >> The >> > > > > UDF will build a heap-based hashtable in setup, in a similar way >> > > > > as what the join approach will do. >> > > > > >> > > > > I'm going to open a JIRA to submit the patch for review, as I >> feel >> > > > > it will benefit not only the in list filter, but also expressions >> > with >> > > > > common column references. >> > > > > >> > > > > >> > > > > [1] >> > > > > >> > > > > >> > > > >> > > >> > >> https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAC-7oTym0Yzr2RmXhDPag6k41se-uTkWu0QC%3DMABb7s94DJ0BA%40mail.gmail.com%3E >> > > > > >> > > > > [2] https://gist.github.com/jinfengni/7f6df9ed7d2c761fed33 >> > > > > >> > > > > [3] https://gist.github.com/jinfengni/7460f6d250f0d00009ed >> > > > > >> > > > >> > > >> > >> > >