Could you print out the sql execution plan? My guess is about broadcast join.
> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and > is there a way we can optimize the queries in SPARK without the obvious hack > in Query2. > > > ----------------------- > ENVIRONMENT: > ----------------------- > > > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million > > rows. Both the files are single gzipped csv file. > > Both table A and B are external tables in AWS S3 and created in HIVE > > accessed through SPARK using HiveContext > > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using > > allowMaximumResource allocation and node types are c3.4xlarge). > > -------------- > QUERY1: > -------------- > select A.PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > > > This query takes 4 mins in HIVE and 1.1 hours in SPARK > > > -------------- > QUERY 2: > -------------- > > select A.PK, B.FK > from (select PK from A) A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > This query takes 4.5 mins in SPARK > > > > Regards, > Gourav Sengupta > > >