I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM. That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.
Muhammad, can you try with the equivalent version below ? You mentioned the rewrite but did you try the rewritten version ? SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON ( `t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) ) On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote: I’m not sure why it isn’t working for you. Using Drill 1.10, here’s my output: 0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false; +-------+-------------------------------------------------+ | ok | summary | +-------+-------------------------------------------------+ | true | planner.enable_nljoin_for_scalar_only updated. | +-------+-------------------------------------------------+ 1 row selected (0.137 seconds) 0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2; +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 ProjectAllowDup(*=[$0], *0=[$1]) 00-02 NestedLoopJoin(condition=[true], joinType=[inner]) 00-04 Project(T2¦¦*=[$0]) 00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]]) 00-03 Project(T3¦¦*=[$0]) 00-05 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]]) -- Zelaine On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote: But the query I provided failed to be planned because it's a cartesian join, although I've set the option you mentioned to false. Is there a reason why wouldn't Drill rules physically implement the logical join in my query to a nested loop join ? *---------------------* *Muhammad Gelbana* http://www.linkedin.com/in/mgelbana On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote: > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even > without an explicit join condition, the query should use the Cartesian > join/nested loop join. > > -- Zelaine > > On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com> wrote: > > Hi, > > I have one question here.. so if we have to use Cartesian join in Drill > then do we have to follow some workaround like Shadi mention : adding a > dummy column on the fly that has the value 1 in both tables and then > join > on that column leading to having a match of every row of the first > table > with every row of the second table, hence do a Cartesian product? > OR > If we just don't specify join condition like : > select a.*, b.* from tt1 as a, tt2 b; then will it internally treat > this > query as Cartesian join. > > Regards, > *Anup Tiwari* > > On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote: > > > Cartesian joins in Drill are implemented as nested loop joins, and I > think > > you should see that reflected in the resultant query plan when you > run > > explain plan on the query. > > > > Yes, Cartesian joins/nested loop joins are expensive because you’re > > effectively doing an MxN read of your tables. There are more > efficient > > ways of processing a nested loop join, e.g., by creating an index on > the > > larger table in the join and then using that index to do lookups > into that > > table. That way, the nested loop join cost is the cost of creating > the > > index + M, where M is the number of rows in the smaller table and > assuming > > the lookup cost into the index does minimize the amount of data read > of the > > second table. Drill currently doesn’t do this. > > > > -- Zelaine > > > > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote: > > > > I believe clhubert is referring to this discussion > > <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/ > > cartesian-product-in-apache-drill#post1> > > . > > > > So why Drill doesn't transform this query into a nested join > query ? > > Simply > > because there is no Calcite rule to transform it into a nested > loop > > join ? > > Is it not technically possible to write such Rule or is it > feasible so > > I > > may take on this challenge ? > > > > Also pardon me for repeating my question but I fail to find an > answer > > in > > your replies, why doesn't Drill just run a cartesian join ? > Because > > it's > > expensive regarding resources (i.e. CPU\Network\RAM) ? > > > > Thanks a lot Shadi for the query, it works for me. > > > > *---------------------* > > *Muhammad Gelbana* > > http://www.linkedin.com/in/mgelbana > > > > On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa < > khal...@cs.queensu.ca> > > wrote: > > > > > Hi Muhammad, > > > > > > I did the following as a workaround to have Cartesian product. > The > > basic > > > idea is to create a dummy column on the fly that has the value > 1 in > > both > > > tables and then join on that column leading to having a match > of > > every row > > > of the first table with every row of the second table, hence > do a > > Cartesian > > > product. This might not be the most efficient way but it will > do the > > job. > > > > > > *Original Query:* > > > SELECT * FROM > > > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc > LIMIT > > > 2147483647) `t0` > > > INNER JOIN > > > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc > LIMIT > > > 2147483647) `t1` > > > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`) > > > LIMIT 2147483647 > > > > > > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to > tables > > one > > > and two, respectively. Names don't really matter, just need to > be > > unique):* > > > SELECT * FROM > > > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM > > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` > > > INNER JOIN > > > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM > > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` > > > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*) > > > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID` > > > LIMIT 2147483647 > > > > > > Regards > > > > > > > > > *Shadi Khalifa, PhD* > > > Postdoctoral Fellow > > > Cognitive Analytics Development Hub > > > Centre for Advanced Computing > > > Queen’s University > > > (613) 533-6000 x78347 > > > http://cac.queensu.ca > > > > > > I'm just a neuron in the society collective brain > > > > > > *Join us for HPCS in June 2017! Register at:* * > http://2017.hpcs.ca/ > > > <http://2017.hpcs.ca/>* > > > > > > P Please consider your environmental responsibility before > printing > > this > > > e-mail > > > > > > *01001001 00100000 01101100 01101111 01110110 01100101 00100000 > > 01000101 > > > 01100111 01111001 01110000 01110100 * > > > > > > *The information transmitted is intended only for the person or > > entity to > > > which it is addressed and may contain confidential material. > Any > > review or > > > dissemination of this information by persons other than the > intended > > > recipient is prohibited. If you received this in error, please > > contact the > > > sender and delete the material from any computer. Thank you.* > > > > > > > > > > > > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana < > > m.gelb...@gmail.com> > > > wrote: > > > > > > > > > > > > Here it is: > > > > > > SELECT * FROM (SELECT 'ABC' `UserID` FROM > > `dfs`.`path_to_parquet_file` tc > > > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM > > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON ( > > > > > > `t0`.`UserID` IS NOT DISTINCT FROM > > > > > > `t1`.`UserID`) LIMIT 2147483647 > > > > > > I debugged Drill code and found it decomposes *IS NOT DISTINCT > FROM* > > into > > > > > > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && > > `t1`.`UserID` > > > IS NULL**)* while checking if the query is a cartesian join, > and > > when the > > > check returns true, it throws an excetion saying: *This query > cannot > > be > > > planned possibly due to either a cartesian join or an > inequality > > join* > > > > > > > > > *---------------------* > > > *Muhammad Gelbana* > > > http://www.linkedin.com/in/mgelbana > > > > > > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gpa...@mapr.com> > > wrote: > > > > > > > Can you please specify the query you are trying to execute? > > > > > > > > > > > > Gautam > > > > > > > > ________________________________ > > > > From: Muhammad Gelbana <m.gelb...@gmail.com> > > > > Sent: Saturday, May 6, 2017 7:34:53 AM > > > > To: u...@drill.apache.org; dev@drill.apache.org > > > > Subject: Running cartesian joins on Drill > > > > > > > > Is there a reason why Drill would intentionally reject > cartesian > > join > > > > queries even if *planner.enable_nljoin_for_scalar_only* is > > disabled ? > > > > > > > > Any ideas how could a query be rewritten to overcome this > > restriction ? > > > > > > > > *---------------------* > > > > *Muhammad Gelbana* > > > > http://www.linkedin.com/in/mgelbana > > > > > > > > > > > > > > > > > > > > > >