Antonio Piccolboni created SPARK-13169: ------------------------------------------
Summary: CROSS JOIN slow or fails on tiny table
Key: SPARK-13169
URL: https://issues.apache.org/jira/browse/SPARK-13169
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.0
Reporter: Antonio Piccolboni
I am running a cross join with a distinct select on both sides. Table is tiny
(32 X 16). Running query through the thriftserver. Data is here
(https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv). Query
never terminates before 200s, mostly fails (TTransportException) while all
cores are being used (single machine).
Query is
{code}
SELECT `gwtlyrpywf`.`gear`,`gwtlyrpywf`.`cyl`,`vs` FROM (
SELECT DISTINCT * FROM (
SELECT `gear` AS `gear`, `cyl` AS `cyl`FROM `mtcars`)
AS `zzz1`)
AS `gwtlyrpywf`
CROSS JOIN (
SELECT DISTINCT * FROM (
SELECT `vs` AS `vs` FROM `mtcars`)
AS `zzz3`)
AS `arytvfispy`
{code}
I know it can be simplified, but it comes from a generator and the generator
counts on the optimizer to do the right thing. EXPLAIN shows the following
{code}
plan
1
== Physical Plan ==
2
Project [gear#21,cyl#22,vs#23]
3
+- CartesianProduct
4
:- ConvertToSafe
5
: +-
TungstenAggregate(key=[gear#21,cyl#22], functions=[], output=[gear#21,cyl#22])
6
: +- TungstenExchange
hashpartitioning(gear#21,cyl#22,200), None
7
: +-
TungstenAggregate(key=[gear#21,cyl#22], functions=[], output=[gear#21,cyl#22])
8
: +-
Project [gear#17 AS gear#21,cyl#9 AS cyl#22]
9 : +- Scan
CsvRelation(<function0>,Some(/var/folders/_p/1gx4vy311_x4syn2xq6f2xtc0000gr/T//RtmpeDwNvS/file168c154ef10e),true,,,",null,#,FAILFAST,commons,false,false,false,StructType(StructField(mpg,DoubleType,true),
StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true),
StructField(hp,DoubleType,true), StructField(drat,DoubleType,true),
StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true),
StructField(vs,DoubleType,true), StructField(am,DoubleType,true),
StructField(gear,DoubleType,true),
StructField(carb,DoubleType,true)),true,null)[gear#17,cyl#9]
10
+- ConvertToSafe
11
+-
TungstenAggregate(key=[vs#23], functions=[], output=[vs#23])
12
+-
TungstenExchange hashpartitioning(vs#23,200), None
13
+-
TungstenAggregate(key=[vs#23], functions=[], output=[vs#23])
14
+- Project [vs#15 AS vs#23]
15 +- Scan
CsvRelation(<function0>,Some(/var/folders/_p/1gx4vy311_x4syn2xq6f2xtc0000gr/T//RtmpeDwNvS/file168c154ef10e),true,,,",null,#,FAILFAST,commons,false,false,false,StructType(StructField(mpg,DoubleType,true),
StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true),
StructField(hp,DoubleType,true), StructField(drat,DoubleType,true),
StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true),
StructField(vs,DoubleType,true), StructField(am,DoubleType,true),
StructField(gear,DoubleType,true),
StructField(carb,DoubleType,true)),true,null)[vs#15]
{code}
Thanks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
