are you using map-reduce with Hive? Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here > and is there a way we can optimize the queries in SPARK without the obvious > hack in Query2. > > > ----------------------- > ENVIRONMENT: > ----------------------- > > > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 > million rows. Both the files are single gzipped csv file. > > Both table A and B are external tables in AWS S3 and created in HIVE > accessed through SPARK using HiveContext > > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using > allowMaximumResource allocation and node types are c3.4xlarge). > > -------------- > QUERY1: > -------------- > select A.PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > > > This query takes 4 mins in HIVE and 1.1 hours in SPARK > > > -------------- > QUERY 2: > -------------- > > select A.PK, B.FK > from (select PK from A) A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > This query takes 4.5 mins in SPARK > > > > Regards, > Gourav Sengupta > > > >