I still fail to see how Hive can do orders of magnitude faster compared to Spark.
Assuming that Hive is using map-reduce, I cannot see a real case for Hive to do faster than at least under normal operations Don't take me wrong. I am a fan of Hive. The performance of Hive comes from deploying the execution engine (mr, spark, tez) to do the execution of the work. If we leave that aside for now the other influencing factor would be Hive Optimizer compared to Spark Optimizer. If I go back to thread owner point and quote: "Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2. Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million rows. Both the files are single gzipped csv file. > Both table A and B are external tables in AWS S3 and created in HIVE accessed through SPARK using HiveContext > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using allowMaximumResource allocation and node types are c3.4xlarge). To take it further and make some reasonable deduction: With gzipped files: 1. Hive will not be able to split the csv files into chunks/blocks and run multiple maps in parallel 2. Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable If you do not repartition the RDD, any operations on that RDD will be limited to a single core. So with zipped files both Hive and Spark have issues. both tables have not a very large number of rows. With Spark were temporary tables deployed that IMO does help performance. It is possible that Spark has been spilling to disk. We really need the output from GUI jobs, stages and spillage like below to deduce if there was indeed spillage to disk by Spark see (TungstenAggregate) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 June 2016 at 21:40, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Stephen, > > > How can a single gzipped CSV file be partitioned and who partitions tables > based on Primary Key in Hive? If you read the environments section you > will be able to see that all the required details are mentioned. > > As far as I understand that Hive does work 25x faster (in these particular > cases) and around 100x faster (when we are using TEZ) when compared to > SPARK. > > It will be interesting to see if Ted includes these findings while they > are benchmarking SPARK. This is a very typical and a general used case. > > > Regards, > Gourav > > On Thu, Jun 9, 2016 at 5:11 PM, Stephen Boesch <java...@gmail.com> wrote: > >> ooc are the tables partitioned on a.pk and b.fk? Hive might be using >> copartitioning in that case: it is one of hive's strengths. >> >> 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>: >> >>> Hi Mich, >>> >>> does not Hive use map-reduce? I thought it to be so. And since I am >>> running the queries in EMR 4.6 therefore HIVE is not using TEZ. >>> >>> >>> Regards, >>> Gourav >>> >>> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> are you using map-reduce with Hive? >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening >>>>> here and is there a way we can optimize the queries in SPARK without the >>>>> obvious hack in Query2. >>>>> >>>>> >>>>> ----------------------- >>>>> ENVIRONMENT: >>>>> ----------------------- >>>>> >>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>>>> million rows. Both the files are single gzipped csv file. >>>>> > Both table A and B are external tables in AWS S3 and created in HIVE >>>>> accessed through SPARK using HiveContext >>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>>>> allowMaximumResource allocation and node types are c3.4xlarge). >>>>> >>>>> -------------- >>>>> QUERY1: >>>>> -------------- >>>>> select A.PK, B.FK >>>>> from A >>>>> left outer join B on (A.PK = B.FK) >>>>> where B.FK is not null; >>>>> >>>>> >>>>> >>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK >>>>> >>>>> >>>>> -------------- >>>>> QUERY 2: >>>>> -------------- >>>>> >>>>> select A.PK, B.FK >>>>> from (select PK from A) A >>>>> left outer join B on (A.PK = B.FK) >>>>> where B.FK is not null; >>>>> >>>>> This query takes 4.5 mins in SPARK >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Gourav Sengupta >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >