Re: Running cartesian joins on Drill

Aman Sinha Thu, 11 May 2017 16:29:55 -0700

I think Muhammad may be trying to run his original query with IS NOT DISTINCT 
FROM.   That discussion got side-tracked into Cartesian joins because his query 
was not getting planned and the error was about Cartesian join.


Muhammad,  can you try with the equivalent version below ?  You mentioned the 
rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (



`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS 
NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <[email protected]> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my 
output:



    0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <[email protected]> wrote:



        But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join 
in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <[email protected]> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <[email protected]> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in 
Drill

        >     then do we have to follow some workaround like Shadi mention : 
adding a

        >     dummy column on the fly that has the value 1 in both tables and 
then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally 
treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <[email protected]> 
wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, 
and I

        > think

        >     > you should see that reflected in the resultant query plan when 
you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because 
you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an 
index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of 
creating

        > the

        >     > index + M, where M is the number of rows in the smaller table 
and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data 
read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <[email protected]> 
wrote:

        >     >

        >     >     I believe clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a 
nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find 
an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > [email protected]>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian 
product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the 
value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a 
match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, 
hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it 
will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` 
tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` 
tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 
to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just 
need to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 
00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the 
person or

        >     > entity to

        >     >     > which it is addressed and may contain confidential 
material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than 
the

        > intended

        >     >     > recipient is prohibited. If you received this in error, 
please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank 
you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > [email protected]>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > 

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` 
FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON 
(

        >     >     > 

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > 

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT 
DISTINCT

        > FROM*

        >     > into

        >     >     > 

        >     >     > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL 
&&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian 
join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This 
query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai 
<[email protected]>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to 
execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <[email protected]>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: [email protected]; [email protected]

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* 
is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome 
this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >

Re: Running cartesian joins on Drill

Reply via email to