Could you print out the sql execution plan? My guess is about broadcast join. 



> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> Hi,
> 
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and 
> is there a way we can optimize the queries in SPARK without the obvious hack 
> in Query2.
> 
> 
> -----------------------
> ENVIRONMENT:
> -----------------------
> 
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million 
> > rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE 
> > accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using 
> > allowMaximumResource allocation and node types are c3.4xlarge).
> 
> --------------
> QUERY1: 
> --------------
> select A.PK, B.FK
> from A 
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
> 
> 
> 
> This query takes 4 mins in HIVE and 1.1 hours in SPARK 
> 
> 
> --------------
> QUERY 2:
> --------------
> 
> select A.PK, B.FK
> from (select PK from A) A 
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
> 
> This query takes 4.5 mins in SPARK 
> 
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 

Reply via email to