You are correct Aman. Here is the JIRA issue <https://issues.apache.org/jira/browse/DRILL-5515>
This thread has been very helpful. Thank you all. *---------------------* *Muhammad Gelbana* http://www.linkedin.com/in/mgelbana On Fri, May 12, 2017 at 6:50 AM, Aman Sinha <asi...@mapr.com> wrote: > Muhammad, > The join condition ‘a = b or (a is null && b is null)’ works. > Internally, this is converted to ‘a is not distinct from b’ which is > processed by Drill. > For some reason, if the second form is directly supplied in the user > query, it is not working and ends up with the Cartesian join condition. > Drill leverages Calcite for this (you can see CALCITE-1200 for some > background). > Can you file a JIRA for this ? > > -Aman > > From: "Aman Sinha (asi...@mapr.com)" <asi...@mapr.com> > Date: Thursday, May 11, 2017 at 4:29 PM > To: dev <dev@drill.apache.org>, user <u...@drill.apache.org> > Cc: Shadi Khalifa <khal...@cs.queensu.ca> > Subject: Re: Running cartesian joins on Drill > > > I think Muhammad may be trying to run his original query with IS NOT > DISTINCT FROM. That discussion got side-tracked into Cartesian joins > because his query was not getting planned and the error was about Cartesian > join. > > Muhammad, can you try with the equivalent version below ? You mentioned > the rewrite but did you try the rewritten version ? > > > > SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc > > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON ( > > > > `t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` > IS NULL) ) > > > > On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote: > > > > I’m not sure why it isn’t working for you. Using Drill 1.10, here’s > my output: > > > > 0: jdbc:drill:zk=local> alter session set > `planner.enable_nljoin_for_scalar_only` > = false; > > +-------+-------------------------------------------------+ > > | ok | summary | > > +-------+-------------------------------------------------+ > > | true | planner.enable_nljoin_for_scalar_only updated. | > > +-------+-------------------------------------------------+ > > 1 row selected (0.137 seconds) > > 0: jdbc:drill:zk=local> explain plan for select * from > dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2; > > +------+------+ > > | text | json | > > +------+------+ > > | 00-00 Screen > > 00-01 ProjectAllowDup(*=[$0], *0=[$1]) > > 00-02 NestedLoopJoin(condition=[true], joinType=[inner]) > > 00-04 Project(T2¦¦*=[$0]) > > 00-06 Scan(groupscan=[EasyGroupScan > [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], > files=[file:/Users/zfong/foo.csv]]]) > > 00-03 Project(T3¦¦*=[$0]) > > 00-05 Scan(groupscan=[EasyGroupScan > [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], > files=[file:/Users/zfong/foo.csv]]]) > > > > -- Zelaine > > > > On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote: > > > > But the query I provided failed to be planned because it's a > cartesian > > join, although I've set the option you mentioned to false. Is > there a > > reason why wouldn't Drill rules physically implement the logical > join in my > > query to a nested loop join ? > > > > *---------------------* > > *Muhammad Gelbana* > > http://www.linkedin.com/in/mgelbana > > > > On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> > wrote: > > > > > Provided `planner.enable_nljoin_for_scalar_only` is set to > false, even > > > without an explicit join condition, the query should use the > Cartesian > > > join/nested loop join. > > > > > > -- Zelaine > > > > > > On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com> > wrote: > > > > > > Hi, > > > > > > I have one question here.. so if we have to use Cartesian > join in Drill > > > then do we have to follow some workaround like Shadi mention > : adding a > > > dummy column on the fly that has the value 1 in both tables > and then > > > join > > > on that column leading to having a match of every row of the > first > > > table > > > with every row of the second table, hence do a Cartesian > product? > > > OR > > > If we just don't specify join condition like : > > > select a.*, b.* from tt1 as a, tt2 b; then will it > internally treat > > > this > > > query as Cartesian join. > > > > > > Regards, > > > *Anup Tiwari* > > > > > > On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong < > zf...@mapr.com> wrote: > > > > > > > Cartesian joins in Drill are implemented as nested loop > joins, and I > > > think > > > > you should see that reflected in the resultant query plan > when you > > > run > > > > explain plan on the query. > > > > > > > > Yes, Cartesian joins/nested loop joins are expensive > because you’re > > > > effectively doing an MxN read of your tables. There are > more > > > efficient > > > > ways of processing a nested loop join, e.g., by creating > an index on > > > the > > > > larger table in the join and then using that index to do > lookups > > > into that > > > > table. That way, the nested loop join cost is the cost of > creating > > > the > > > > index + M, where M is the number of rows in the smaller > table and > > > assuming > > > > the lookup cost into the index does minimize the amount of > data read > > > of the > > > > second table. Drill currently doesn’t do this. > > > > > > > > -- Zelaine > > > > > > > > On 5/8/17, 9:09 AM, "Muhammad Gelbana" < > m.gelb...@gmail.com> wrote: > > > > > > > > I believe clhubert is referring to this discussion > > > > <http://drill-user.incubator. > apache.narkive.com/TIXWiTY4/ > > > > cartesian-product-in-apache-drill#post1> > > > > . > > > > > > > > So why Drill doesn't transform this query into a > nested join > > > query ? > > > > Simply > > > > because there is no Calcite rule to transform it into > a nested > > > loop > > > > join ? > > > > Is it not technically possible to write such Rule or > is it > > > feasible so > > > > I > > > > may take on this challenge ? > > > > > > > > Also pardon me for repeating my question but I fail to > find an > > > answer > > > > in > > > > your replies, why doesn't Drill just run a cartesian > join ? > > > Because > > > > it's > > > > expensive regarding resources (i.e. CPU\Network\RAM) ? > > > > > > > > Thanks a lot Shadi for the query, it works for me. > > > > > > > > *---------------------* > > > > *Muhammad Gelbana* > > > > http://www.linkedin.com/in/mgelbana > > > > > > > > On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa < > > > khal...@cs.queensu.ca> > > > > wrote: > > > > > > > > > Hi Muhammad, > > > > > > > > > > I did the following as a workaround to have > Cartesian product. > > > The > > > > basic > > > > > idea is to create a dummy column on the fly that has > the value > > > 1 in > > > > both > > > > > tables and then join on that column leading to > having a match > > > of > > > > every row > > > > > of the first table with every row of the second > table, hence > > > do a > > > > Cartesian > > > > > product. This might not be the most efficient way > but it will > > > do the > > > > job. > > > > > > > > > > *Original Query:* > > > > > SELECT * FROM > > > > > ( SELECT 'ABC' `UserID` FROM > `dfs`.`path_to_parquet_file` tc > > > LIMIT > > > > > 2147483647) `t0` > > > > > INNER JOIN > > > > > ( SELECT 'ABC' `UserID` FROM > `dfs`.`path_to_parquet_file` tc > > > LIMIT > > > > > 2147483647) `t1` > > > > > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`) > > > > > LIMIT 2147483647 > > > > > > > > > > *Workaround (add columns **d1a381f3g73 and > **d1a381f3g74 to > > > tables > > > > one > > > > > and two, respectively. Names don't really matter, > just need to > > > be > > > > unique):* > > > > > SELECT * FROM > > > > > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM > > > > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) > `t0` > > > > > INNER JOIN > > > > > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM > > > > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) > `t1` > > > > > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*) > > > > > WHERE `t0`.`UserID` IS NOT DISTINCT FROM > `t1`.`UserID` > > > > > LIMIT 2147483647 > > > > > > > > > > Regards > > > > > > > > > > > > > > > *Shadi Khalifa, PhD* > > > > > Postdoctoral Fellow > > > > > Cognitive Analytics Development Hub > > > > > Centre for Advanced Computing > > > > > Queen’s University > > > > > (613) 533-6000 x78347 > > > > > http://cac.queensu.ca > > > > > > > > > > I'm just a neuron in the society collective brain > > > > > > > > > > *Join us for HPCS in June 2017! Register at:* * > > > http://2017.hpcs.ca/ > > > > > <http://2017.hpcs.ca/>* > > > > > > > > > > P Please consider your environmental responsibility > before > > > printing > > > > this > > > > > e-mail > > > > > > > > > > *01001001 00100000 01101100 01101111 01110110 > 01100101 00100000 > > > > 01000101 > > > > > 01100111 01111001 01110000 01110100 * > > > > > > > > > > *The information transmitted is intended only for > the person or > > > > entity to > > > > > which it is addressed and may contain confidential > material. > > > Any > > > > review or > > > > > dissemination of this information by persons other > than the > > > intended > > > > > recipient is prohibited. If you received this in > error, please > > > > contact the > > > > > sender and delete the material from any computer. > Thank you.* > > > > > > > > > > > > > > > > > > > > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana < > > > > m.gelb...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Here it is: > > > > > > > > > > SELECT * FROM (SELECT 'ABC' `UserID` FROM > > > > `dfs`.`path_to_parquet_file` tc > > > > > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' > `UserID` FROM > > > > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) > `t1` ON ( > > > > > > > > > > `t0`.`UserID` IS NOT DISTINCT FROM > > > > > > > > > > `t1`.`UserID`) LIMIT 2147483647 > > > > > > > > > > I debugged Drill code and found it decomposes *IS > NOT DISTINCT > > > FROM* > > > > into > > > > > > > > > > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS > NULL && > > > > `t1`.`UserID` > > > > > IS NULL**)* while checking if the query is a > cartesian join, > > > and > > > > when the > > > > > check returns true, it throws an excetion saying: > *This query > > > cannot > > > > be > > > > > planned possibly due to either a cartesian join or an > > > inequality > > > > join* > > > > > > > > > > > > > > > *---------------------* > > > > > *Muhammad Gelbana* > > > > > http://www.linkedin.com/in/mgelbana > > > > > > > > > > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai < > gpa...@mapr.com> > > > > wrote: > > > > > > > > > > > Can you please specify the query you are trying to > execute? > > > > > > > > > > > > > > > > > > Gautam > > > > > > > > > > > > ________________________________ > > > > > > From: Muhammad Gelbana <m.gelb...@gmail.com> > > > > > > Sent: Saturday, May 6, 2017 7:34:53 AM > > > > > > To: u...@drill.apache.org; dev@drill.apache.org > > > > > > Subject: Running cartesian joins on Drill > > > > > > > > > > > > Is there a reason why Drill would intentionally > reject > > > cartesian > > > > join > > > > > > queries even if > *planner.enable_nljoin_for_scalar_only* > is > > > > disabled ? > > > > > > > > > > > > Any ideas how could a query be rewritten to > overcome this > > > > restriction ? > > > > > > > > > > > > *---------------------* > > > > > > *Muhammad Gelbana* > > > > > > http://www.linkedin.com/in/mgelbana > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >