Re: Running cartesian joins on Drill

Muhammad Gelbana Tue, 16 May 2017 02:38:08 -0700

You are correct Aman. Here is the JIRA issue
<https://issues.apache.org/jira/browse/DRILL-5515>


This thread has been very helpful. Thank you all.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Fri, May 12, 2017 at 6:50 AM, Aman Sinha <asi...@mapr.com> wrote:

> Muhammad,
> The join condition  ‘a = b or (a is null && b is null)’ works.
> Internally, this is converted to  ‘a is not distinct from b’ which is
> processed by Drill.
> For some reason, if the second form is directly supplied in the user
> query, it is not working and ends up with the Cartesian join condition.
> Drill leverages Calcite for this (you can see CALCITE-1200 for some
> background).
> Can you file a JIRA for this ?
>
> -Aman
>
> From: "Aman Sinha (asi...@mapr.com)" <asi...@mapr.com>
> Date: Thursday, May 11, 2017 at 4:29 PM
> To: dev <dev@drill.apache.org>, user <u...@drill.apache.org>
> Cc: Shadi Khalifa <khal...@cs.queensu.ca>
> Subject: Re: Running cartesian joins on Drill
>
>
> I think Muhammad may be trying to run his original query with IS NOT
> DISTINCT FROM.   That discussion got side-tracked into Cartesian joins
> because his query was not getting planned and the error was about Cartesian
> join.
>
> Muhammad,  can you try with the equivalent version below ?  You mentioned
> the rewrite but did you try the rewritten version ?
>
>
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
>
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>
> 
>
> `t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL) )
>
>
>
> On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:
>
>
>
>     I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s
> my output:
>
>
>
>     0: jdbc:drill:zk=local> alter session set 
> `planner.enable_nljoin_for_scalar_only`
> = false;
>
>     +-------+-------------------------------------------------+
>
>     |  ok   |                     summary                     |
>
>     +-------+-------------------------------------------------+
>
>     | true  | planner.enable_nljoin_for_scalar_only updated.  |
>
>     +-------+-------------------------------------------------+
>
>     1 row selected (0.137 seconds)
>
>     0: jdbc:drill:zk=local> explain plan for select * from
> dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
>
>     +------+------+
>
>     | text | json |
>
>     +------+------+
>
>     | 00-00    Screen
>
>     00-01      ProjectAllowDup(*=[$0], *0=[$1])
>
>     00-02        NestedLoopJoin(condition=[true], joinType=[inner])
>
>     00-04          Project(T2¦¦*=[$0])
>
>     00-06            Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>     00-03          Project(T3¦¦*=[$0])
>
>     00-05            Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>
>
>     -- Zelaine
>
>
>
>     On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote:
>
>
>
>         But the query I provided failed to be planned because it's a
> cartesian
>
>         join, although I've set the option you mentioned to false. Is
> there a
>
>         reason why wouldn't Drill rules physically implement the logical
> join in my
>
>         query to a nested loop join ?
>
>
>
>         *---------------------*
>
>         *Muhammad Gelbana*
>
>         http://www.linkedin.com/in/mgelbana
>
>
>
>         On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com>
> wrote:
>
>
>
>         > Provided `planner.enable_nljoin_for_scalar_only` is set to
> false, even
>
>         > without an explicit join condition, the query should use the
> Cartesian
>
>         > join/nested loop join.
>
>         >
>
>         > -- Zelaine
>
>         >
>
>         > On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com>
> wrote:
>
>         >
>
>         >     Hi,
>
>         >
>
>         >     I have one question here.. so if we have to use Cartesian
> join in Drill
>
>         >     then do we have to follow some workaround like Shadi mention
> : adding a
>
>         >     dummy column on the fly that has the value 1 in both tables
> and then
>
>         > join
>
>         >     on that column leading to having a match of every row of the
> first
>
>         > table
>
>         >     with every row of the second table, hence do a Cartesian
> product?
>
>         >     OR
>
>         >     If we just don't specify join condition like :
>
>         >     select a.*, b.* from tt1 as a, tt2 b; then will it
> internally treat
>
>         > this
>
>         >     query as Cartesian join.
>
>         >
>
>         >     Regards,
>
>         >     *Anup Tiwari*
>
>         >
>
>         >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <
> zf...@mapr.com> wrote:
>
>         >
>
>         >     > Cartesian joins in Drill are implemented as nested loop
> joins, and I
>
>         > think
>
>         >     > you should see that reflected in the resultant query plan
> when you
>
>         > run
>
>         >     > explain plan on the query.
>
>         >     >
>
>         >     > Yes, Cartesian joins/nested loop joins are expensive
> because you’re
>
>         >     > effectively doing an MxN read of your tables.  There are
> more
>
>         > efficient
>
>         >     > ways of processing a nested loop join, e.g., by creating
> an index on
>
>         > the
>
>         >     > larger table in the join and then using that index to do
> lookups
>
>         > into that
>
>         >     > table.  That way, the nested loop join cost is the cost of
> creating
>
>         > the
>
>         >     > index + M, where M is the number of rows in the smaller
> table and
>
>         > assuming
>
>         >     > the lookup cost into the index does minimize the amount of
> data read
>
>         > of the
>
>         >     > second table.  Drill currently doesn’t do this.
>
>         >     >
>
>         >     > -- Zelaine
>
>         >     >
>
>         >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <
> m.gelb...@gmail.com> wrote:
>
>         >     >
>
>         >     >     I believe clhubert is referring to this discussion
>
>         >     >     <http://drill-user.incubator.
> apache.narkive.com/TIXWiTY4/
>
>         >     > cartesian-product-in-apache-drill#post1>
>
>         >     >     .
>
>         >     >
>
>         >     >     So why Drill doesn't transform this query into a
> nested join
>
>         > query ?
>
>         >     > Simply
>
>         >     >     because there is no Calcite rule to transform it into
> a nested
>
>         > loop
>
>         >     > join ?
>
>         >     >     Is it not technically possible to write such Rule or
> is it
>
>         > feasible so
>
>         >     > I
>
>         >     >     may take on this challenge ?
>
>         >     >
>
>         >     >     Also pardon me for repeating my question but I fail to
> find an
>
>         > answer
>
>         >     > in
>
>         >     >     your replies, why doesn't Drill just run a cartesian
> join ?
>
>         > Because
>
>         >     > it's
>
>         >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
>
>         >     >
>
>         >     >     Thanks a lot Shadi for the query, it works for me.
>
>         >     >
>
>         >     >     *---------------------*
>
>         >     >     *Muhammad Gelbana*
>
>         >     >     http://www.linkedin.com/in/mgelbana
>
>         >     >
>
>         >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
>
>         > khal...@cs.queensu.ca>
>
>         >     > wrote:
>
>         >     >
>
>         >     >     > Hi Muhammad,
>
>         >     >     >
>
>         >     >     > I did the following as a workaround to have
> Cartesian product.
>
>         > The
>
>         >     > basic
>
>         >     >     > idea is to create a dummy column on the fly that has
> the value
>
>         > 1 in
>
>         >     > both
>
>         >     >     > tables and then join on that column leading to
> having a match
>
>         > of
>
>         >     > every row
>
>         >     >     > of the first table with every row of the second
> table, hence
>
>         > do a
>
>         >     > Cartesian
>
>         >     >     > product. This might not be the most efficient way
> but it will
>
>         > do the
>
>         >     > job.
>
>         >     >     >
>
>         >     >     > *Original Query:*
>
>         >     >     > SELECT * FROM
>
>         >     >     > ( SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>
>         > LIMIT
>
>         >     >     > 2147483647) `t0`
>
>         >     >     > INNER JOIN
>
>         >     >     > ( SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>
>         > LIMIT
>
>         >     >     > 2147483647) `t1`
>
>         >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>
>         >     >     > LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > *Workaround (add columns **d1a381f3g73 and
> **d1a381f3g74 to
>
>         > tables
>
>         >     > one
>
>         >     >     > and two, respectively. Names don't really matter,
> just need to
>
>         > be
>
>         >     > unique):*
>
>         >     >     > SELECT * FROM
>
>         >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t0`
>
>         >     >     > INNER JOIN
>
>         >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t1`
>
>         >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>
>         >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM
> `t1`.`UserID`
>
>         >     >     > LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > Regards
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > *Shadi Khalifa, PhD*
>
>         >     >     > Postdoctoral Fellow
>
>         >     >     > Cognitive Analytics Development Hub
>
>         >     >     > Centre for Advanced Computing
>
>         >     >     > Queen’s University
>
>         >     >     > (613) 533-6000 x78347
>
>         >     >     > http://cac.queensu.ca
>
>         >     >     >
>
>         >     >     > I'm just a neuron in the society collective brain
>
>         >     >     >
>
>         >     >     > *Join us for HPCS in June 2017! Register at:*  *
>
>         > http://2017.hpcs.ca/
>
>         >     >     > <http://2017.hpcs.ca/>*
>
>         >     >     >
>
>         >     >     > P Please consider your environmental responsibility
> before
>
>         > printing
>
>         >     > this
>
>         >     >     > e-mail
>
>         >     >     >
>
>         >     >     > *01001001 00100000 01101100 01101111 01110110
> 01100101 00100000
>
>         >     > 01000101
>
>         >     >     > 01100111 01111001 01110000 01110100 *
>
>         >     >     >
>
>         >     >     > *The information transmitted is intended only for
> the person or
>
>         >     > entity to
>
>         >     >     > which it is addressed and may contain confidential
> material.
>
>         > Any
>
>         >     > review or
>
>         >     >     > dissemination of this information by persons other
> than the
>
>         > intended
>
>         >     >     > recipient is prohibited. If you received this in
> error, please
>
>         >     > contact the
>
>         >     >     > sender and delete the material from any computer.
> Thank you.*
>
>         >     >     >
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>
>         >     > m.gelb...@gmail.com>
>
>         >     >     > wrote:
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > 
>
>         >     >     > Here it is:
>
>         >     >     >
>
>         >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
>
>         >     > `dfs`.`path_to_parquet_file` tc
>
>         >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC'
> `UserID` FROM
>
>         >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
> `t1` ON (
>
>         >     >     > 
>
>         >     >     > `t0`.`UserID` IS NOT DISTINCT FROM
>
>         >     >     > 
>
>         >     >     > `t1`.`UserID`) LIMIT 2147483647
>
>         >     >     >
>
>         >     >     > I debugged Drill code and found it decomposes *IS
> NOT DISTINCT
>
>         > FROM*
>
>         >     > into
>
>         >     >     > 
>
>         >     >     > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS
> NULL &&
>
>         >     > `t1`.`UserID`
>
>         >     >     > IS NULL**)* while checking if the query is a
> cartesian join,
>
>         > and
>
>         >     > when the
>
>         >     >     > check returns true, it throws an excetion saying:
> *This query
>
>         > cannot
>
>         >     > be
>
>         >     >     > planned possibly due to either a cartesian join or an
>
>         > inequality
>
>         >     > join*
>
>         >     >     >
>
>         >     >     >
>
>         >     >     > *---------------------*
>
>         >     >     > *Muhammad Gelbana*
>
>         >     >     > http://www.linkedin.com/in/mgelbana
>
>         >     >     >
>
>         >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <
> gpa...@mapr.com>
>
>         >     > wrote:
>
>         >     >     >
>
>         >     >     > > Can you please specify the query you are trying to
> execute?
>
>         >     >     > >
>
>         >     >     > >
>
>         >     >     > > Gautam
>
>         >     >     > >
>
>         >     >     > > ________________________________
>
>         >     >     > > From: Muhammad Gelbana <m.gelb...@gmail.com>
>
>         >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>
>         >     >     > > To: u...@drill.apache.org; dev@drill.apache.org
>
>         >     >     > > Subject: Running cartesian joins on Drill
>
>         >     >     > >
>
>         >     >     > > Is there a reason why Drill would intentionally
> reject
>
>         > cartesian
>
>         >     > join
>
>         >     >     > > queries even if 
> *planner.enable_nljoin_for_scalar_only*
> is
>
>         >     > disabled ?
>
>         >     >     > >
>
>         >     >     > > Any ideas how could a query be rewritten to
> overcome this
>
>         >     > restriction ?
>
>         >     >     > >
>
>         >     >     > > *---------------------*
>
>         >     >     > > *Muhammad Gelbana*
>
>         >     >     > > http://www.linkedin.com/in/mgelbana
>
>         >     >     > >
>
>         >     >     >
>
>         >     >     >
>
>         >     >     >
>
>         >     >
>
>         >     >
>
>         >     >
>
>         >
>
>         >
>
>         >
>
>
>
>
>
>
>

Re: Running cartesian joins on Drill

Reply via email to