Hi, Your statement
"I have a system with 64 GB RAM and SSD and its performance on local cluster SPARK is way better" Is this a host with 64GB of RAM and you data is stored on local Solid State Disks? Can you kindly provide the parameters you pass to spark-submit: ${SPARK_HOME}/bin/spark-submit \ --master local[?] \ --driver-memory ?G \ --num-executors 1 \ --executor-memory ?G \ Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 16 June 2016 at 11:40, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > We do have a dimension table with around few hundred columns from which we > need only a few columns to join with the main fact table which has a few > million rows. I do not know how one off this case sounds like but since I > have been working in data warehousing it sounds like a fairly general used > case. > > Spark in local mode will be way faster compared to SPARK running on > HADOOP. I have a system with 64 GB RAM and SSD and its performance on local > cluster SPARK is way better. > > Did your join include the same number of columns and rows for the > dimension table? > > > Regards, > Gourav Sengupta > > On Thu, Jun 16, 2016 at 9:35 AM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> sounds like this is a one off case. >> >> Do you have any other use case where you have Hive on MR outperforms >> Spark? >> >> I did some tests on 1 billion row table getting the selectivity of a >> column using Hive on MR, Hive on Spark engine and Spark running on local >> mode (to keep it simple) >> >> >> Hive 2, Spark 1.6.1 >> >> Results: >> >> Hive with map-reduce --> 18 minutes >> Hive on Spark engine --> 6 minutes >> Spark --> 2 minutes >> >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 16 June 2016 at 08:43, Jörn Franke <jornfra...@gmail.com> wrote: >> >>> I agree here. >>> >>> However it depends always on your use case ! >>> >>> Best regards >>> >>> On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com> >>> wrote: >>> >>> Hi Mahender, >>> >>> please ensure that for dimension tables you are enabling the broadcast >>> method. You must be able to see surprising gains @12x. >>> >>> Overall I think that SPARK cannot figure out whether to scan all the >>> columns in a table or just the ones which are being used causing this >>> issue. >>> >>> When you start using HIVE with ORC and TEZ (*) you will see some >>> amazing results, and leaves SPARK way way behind. So pretty much you need >>> to have your data in memory for matching the performance claims of SPARK >>> and the advantage in that case you are getting is not because of SPARK >>> algorithms but just fast I/O from RAM. The advantage of SPARK is that it >>> makes accessible analytics, querying, and streaming frameworks together. >>> >>> >>> In case you are following the optimisations mentioned in the link you >>> hardly have any reasons for using SPARK SQL: >>> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And >>> imagine being able to do all of that without having machines which requires >>> huge RAM, or in short you are achieving those performance gains using >>> commodity low cost systems around which HADOOP was designed. >>> >>> I think that Hortonworks is giving a stiff competition here :) >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam < >>> mahender.bigd...@outlook.com> wrote: >>> >>>> +1, >>>> >>>> Even see performance degradation while comparing SPark SQL with Hive. >>>> We have table of 260 columns. We have executed in hive and SPARK. In >>>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 >>>> mins of time. >>>> On 6/9/2016 3:19 PM, Gavin Yue wrote: >>>> >>>> Could you print out the sql execution plan? My guess is about broadcast >>>> join. >>>> >>>> >>>> >>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com> >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening >>>> here and is there a way we can optimize the queries in SPARK without the >>>> obvious hack in Query2. >>>> >>>> >>>> ----------------------- >>>> ENVIRONMENT: >>>> ----------------------- >>>> >>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>>> million rows. Both the files are single gzipped csv file. >>>> > Both table A and B are external tables in AWS S3 and created in HIVE >>>> accessed through SPARK using HiveContext >>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>>> allowMaximumResource allocation and node types are c3.4xlarge). >>>> >>>> -------------- >>>> QUERY1: >>>> -------------- >>>> select A.PK, B.FK >>>> from A >>>> left outer join B on (A.PK = B.FK) >>>> where B.FK is not null; >>>> >>>> >>>> >>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK >>>> >>>> >>>> -------------- >>>> QUERY 2: >>>> -------------- >>>> >>>> select A.PK, B.FK >>>> from (select PK from A) A >>>> left outer join B on (A.PK = B.FK) >>>> where B.FK is not null; >>>> >>>> This query takes 4.5 mins in SPARK >>>> >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>> >>>> >>>> >>>> >>> >> >