Hi Mahender, please ensure that for dimension tables you are enabling the broadcast method. You must be able to see surprising gains @12x.
Overall I think that SPARK cannot figure out whether to scan all the columns in a table or just the ones which are being used causing this issue. When you start using HIVE with ORC and TEZ (*) you will see some amazing results, and leaves SPARK way way behind. So pretty much you need to have your data in memory for matching the performance claims of SPARK and the advantage in that case you are getting is not because of SPARK algorithms but just fast I/O from RAM. The advantage of SPARK is that it makes accessible analytics, querying, and streaming frameworks together. In case you are following the optimisations mentioned in the link you hardly have any reasons for using SPARK SQL: http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And imagine being able to do all of that without having machines which requires huge RAM, or in short you are achieving those performance gains using commodity low cost systems around which HADOOP was designed. I think that Hortonworks is giving a stiff competition here :) Regards, Gourav Sengupta On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam < mahender.bigd...@outlook.com> wrote: > +1, > > Even see performance degradation while comparing SPark SQL with Hive. > We have table of 260 columns. We have executed in hive and SPARK. In Hive, > it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins > of time. > On 6/9/2016 3:19 PM, Gavin Yue wrote: > > Could you print out the sql execution plan? My guess is about broadcast > join. > > > > On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com> > gourav.sengu...@gmail.com> wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here > and is there a way we can optimize the queries in SPARK without the obvious > hack in Query2. > > > ----------------------- > ENVIRONMENT: > ----------------------- > > > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 > million rows. Both the files are single gzipped csv file. > > Both table A and B are external tables in AWS S3 and created in HIVE > accessed through SPARK using HiveContext > > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using > allowMaximumResource allocation and node types are c3.4xlarge). > > -------------- > QUERY1: > -------------- > select A.PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > > > This query takes 4 mins in HIVE and 1.1 hours in SPARK > > > -------------- > QUERY 2: > -------------- > > select A.PK, B.FK > from (select PK from A) A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > This query takes 4.5 mins in SPARK > > > > Regards, > Gourav Sengupta > > > > >