+1,

Even see performance degradation while comparing SPark SQL with Hive.
We have table of 260 columns. We have executed in hive and SPARK. In Hive, it 
is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time.

On 6/9/2016 3:19 PM, Gavin Yue wrote:
Could you print out the sql execution plan? My guess is about broadcast join.



On Jun 9, 2016, at 07:14, Gourav Sengupta 
<<mailto:gourav.sengu...@gmail.com>gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>>
 wrote:

Hi,

Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and 
is there a way we can optimize the queries in SPARK without the obvious hack in 
Query2.


-----------------------
ENVIRONMENT:
-----------------------

> Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million 
> rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE accessed 
> through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using 
> allowMaximumResource allocation and node types are c3.4xlarge).

--------------
QUERY1:
--------------
select A.PK<http://A.PK>, B.FK<http://B.FK>
from A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;



This query takes 4 mins in HIVE and 1.1 hours in SPARK


--------------
QUERY 2:
--------------

select A.PK<http://A.PK>, B.FK<http://B.FK>
from (select PK from A) A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;

This query takes 4.5 mins in SPARK



Regards,
Gourav Sengupta




Reply via email to