Re: Running cartesian joins on Drill

Aman Sinha Thu, 11 May 2017 21:51:08 -0700

Muhammad,
The join condition  ‘a = b or (a is null && b is null)’ works.  Internally, 
this is converted to  ‘a is not distinct from b’ which is processed by Drill.
For some reason, if the second form is directly supplied in the user query, it 
is not working and ends up with the Cartesian join condition.  Drill leverages 
Calcite for this (you can see CALCITE-1200 for some background).
Can you file a JIRA for this ?


-Aman

From: "Aman Sinha (asi...@mapr.com)" <asi...@mapr.com>
Date: Thursday, May 11, 2017 at 4:29 PM
To: dev <dev@drill.apache.org>, user <u...@drill.apache.org>
Cc: Shadi Khalifa <khal...@cs.queensu.ca>
Subject: Re: Running cartesian joins on Drill


I think Muhammad may be trying to run his original query with IS NOT DISTINCT 
FROM.   That discussion got side-tracked into Cartesian joins because his query 
was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the 
rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (



`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS 
NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my 
output:



    0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote:



        But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join 
in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in 
Drill

        >     then do we have to follow some workaround like Shadi mention : 
adding a

        >     dummy column on the fly that has the value 1 in both tables and 
then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally 
treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> 
wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, 
and I

        > think

        >     > you should see that reflected in the resultant query plan when 
you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because 
you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an 
index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of 
creating

        > the

        >     > index + M, where M is the number of rows in the smaller table 
and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data 
read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelb...@gmail.com> 
wrote:

        >     >

        >     >     I believe clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a 
nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find 
an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > khal...@cs.queensu.ca>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian 
product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the 
value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a 
match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, 
hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it 
will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` 
tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` 
tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 
to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just 
need to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 
00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the 
person or

        >     > entity to

        >     >     > which it is addressed and may contain confidential 
material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than 
the

        > intended

        >     >     > recipient is prohibited. If you received this in error, 
please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank 
you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > m.gelb...@gmail.com>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > 

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` 
FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON 
(

        >     >     > 

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > 

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT 
DISTINCT

        > FROM*

        >     > into

        >     >     > 

        >     >     > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL 
&&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian 
join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This 
query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai 
<gpa...@mapr.com>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to 
execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <m.gelb...@gmail.com>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: u...@drill.apache.org; dev@drill.apache.org

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* 
is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome 
this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >

Re: Running cartesian joins on Drill

Reply via email to