Re: HIVE Query 25x faster than SPARK Query

Mahender Sarangam Wed, 15 Jun 2016 15:36:09 -0700

+1,

Even see performance degradation while comparing SPark SQL with Hive.
We have table of 260 columns. We have executed in hive and SPARK. In Hive, it 
is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time.


On 6/9/2016 3:19 PM, Gavin Yue wrote:
Could you print out the sql execution plan? My guess is about broadcast join.



On Jun 9, 2016, at 07:14, Gourav Sengupta 
<<mailto:gourav.sengu...@gmail.com>gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>>
 wrote:

Hi,

Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and 
is there a way we can optimize the queries in SPARK without the obvious hack in 
Query2.


-----------------------
ENVIRONMENT:
-----------------------

> Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million 
> rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE accessed 
> through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using 
> allowMaximumResource allocation and node types are c3.4xlarge).

--------------
QUERY1:
--------------
select A.PK<http://A.PK>, B.FK<http://B.FK>
from A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;



This query takes 4 mins in HIVE and 1.1 hours in SPARK


--------------
QUERY 2:
--------------

select A.PK<http://A.PK>, B.FK<http://B.FK>
from (select PK from A) A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;

This query takes 4.5 mins in SPARK



Regards,
Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

Reply via email to