Re: Running cartesian joins on Drill

Anup Tiwari Thu, 11 May 2017 04:21:00 -0700

Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.


Regards,
*Anup Tiwari*

On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <[email protected]> wrote:

> Cartesian joins in Drill are implemented as nested loop joins, and I think
> you should see that reflected in the resultant query plan when you run
> explain plan on the query.
>
> Yes, Cartesian joins/nested loop joins are expensive because you’re
> effectively doing an MxN read of your tables.  There are more efficient
> ways of processing a nested loop join, e.g., by creating an index on the
> larger table in the join and then using that index to do lookups into that
> table.  That way, the nested loop join cost is the cost of creating the
> index + M, where M is the number of rows in the smaller table and assuming
> the lookup cost into the index does minimize the amount of data read of the
> second table.  Drill currently doesn’t do this.
>
> -- Zelaine
>
> On 5/8/17, 9:09 AM, "Muhammad Gelbana" <[email protected]> wrote:
>
>     I believe clhubert is referring to this discussion
>     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
> cartesian-product-in-apache-drill#post1>
>     .
>
>     So why Drill doesn't transform this query into a nested join query ?
> Simply
>     because there is no Calcite rule to transform it into a nested loop
> join ?
>     Is it not technically possible to write such Rule or is it feasible so
> I
>     may take on this challenge ?
>
>     Also pardon me for repeating my question but I fail to find an answer
> in
>     your replies, why doesn't Drill just run a cartesian join ? Because
> it's
>     expensive regarding resources (i.e. CPU\Network\RAM) ?
>
>     Thanks a lot Shadi for the query, it works for me.
>
>     *---------------------*
>     *Muhammad Gelbana*
>     http://www.linkedin.com/in/mgelbana
>
>     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <[email protected]>
> wrote:
>
>     > Hi Muhammad,
>     >
>     > I did the following as a workaround to have Cartesian product. The
> basic
>     > idea is to create a dummy column on the fly that has the value 1 in
> both
>     > tables and then join on that column leading to having a match of
> every row
>     > of the first table with every row of the second table, hence do a
> Cartesian
>     > product. This might not be the most efficient way but it will do the
> job.
>     >
>     > *Original Query:*
>     > SELECT * FROM
>     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
>     > 2147483647) `t0`
>     > INNER JOIN
>     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
>     > 2147483647) `t1`
>     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>     > LIMIT 2147483647
>     >
>     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
> one
>     > and two, respectively. Names don't really matter, just need to be
> unique):*
>     > SELECT * FROM
>     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
>     > INNER JOIN
>     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
>     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
>     > LIMIT 2147483647
>     >
>     > Regards
>     >
>     >
>     > *Shadi Khalifa, PhD*
>     > Postdoctoral Fellow
>     > Cognitive Analytics Development Hub
>     > Centre for Advanced Computing
>     > Queen’s University
>     > (613) 533-6000 x78347
>     > http://cac.queensu.ca
>     >
>     > I'm just a neuron in the society collective brain
>     >
>     > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
>     > <http://2017.hpcs.ca/>*
>     >
>     > P Please consider your environmental responsibility before printing
> this
>     > e-mail
>     >
>     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
> 01000101
>     > 01100111 01111001 01110000 01110100 *
>     >
>     > *The information transmitted is intended only for the person or
> entity to
>     > which it is addressed and may contain confidential material. Any
> review or
>     > dissemination of this information by persons other than the intended
>     > recipient is prohibited. If you received this in error, please
> contact the
>     > sender and delete the material from any computer. Thank you.*
>     >
>     >
>     >
>     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
> [email protected]>
>     > wrote:
>     >
>     >
>     > 
>     > Here it is:
>     >
>     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc
>     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>     > 
>     > `t0`.`UserID` IS NOT DISTINCT FROM
>     > 
>     > `t1`.`UserID`) LIMIT 2147483647
>     >
>     > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
> into
>     > 
>     > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
> `t1`.`UserID`
>     > IS NULL**)* while checking if the query is a cartesian join, and
> when the
>     > check returns true, it throws an excetion saying: *This query cannot
> be
>     > planned possibly due to either a cartesian join or an inequality
> join*
>     >
>     >
>     > *---------------------*
>     > *Muhammad Gelbana*
>     > http://www.linkedin.com/in/mgelbana
>     >
>     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <[email protected]>
> wrote:
>     >
>     > > Can you please specify the query you are trying to execute?
>     > >
>     > >
>     > > Gautam
>     > >
>     > > ________________________________
>     > > From: Muhammad Gelbana <[email protected]>
>     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>     > > To: [email protected]; [email protected]
>     > > Subject: Running cartesian joins on Drill
>     > >
>     > > Is there a reason why Drill would intentionally reject cartesian
> join
>     > > queries even if *planner.enable_nljoin_for_scalar_only* is
> disabled ?
>     > >
>     > > Any ideas how could a query be rewritten to overcome this
> restriction ?
>     > >
>     > > *---------------------*
>     > > *Muhammad Gelbana*
>     > > http://www.linkedin.com/in/mgelbana
>     > >
>     >
>     >
>     >
>
>
>

Re: Running cartesian joins on Drill

Reply via email to