I still fail to see how Hive can do orders of magnitude faster compared to
Spark.

Assuming that Hive is using map-reduce, I cannot see a real case for Hive
to do faster than at least under normal operations

Don't take me wrong. I am a fan of Hive. The performance of Hive comes from
deploying the execution engine (mr, spark, tez) to do the execution of the
work.

If we leave that aside for now the other influencing factor would be Hive
Optimizer compared to Spark Optimizer.

If I go back to thread owner point and quote:

"Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
and is there a way we can optimize the queries in SPARK without the obvious
hack in Query2.

Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million
rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE
accessed through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
allowMaximumResource allocation and node types are c3.4xlarge).

To take it further and make some reasonable deduction:

With gzipped files:

   1. Hive will not be able to split the csv files into chunks/blocks and
   run multiple maps in parallel
   2.  Spark will give you an RDD with only 1 partition (as of 0.9.0). This
   is because gzipped files are not splittable If you do not repartition the
   RDD, any operations on that RDD will be limited to a single core.

So with zipped files both Hive and Spark have issues. both tables have not
a very large number of rows. With Spark were temporary tables deployed that
IMO does help performance. It is possible that Spark has been spilling to
disk.

We really need the output from GUI jobs, stages and spillage like below to
deduce if there was indeed spillage to disk by Spark see (TungstenAggregate)


​


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 21:40, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:

> Hi Stephen,
>
>
> How can a single gzipped CSV file be partitioned and who partitions tables
> based on Primary Key in Hive?  If you read the environments section you
> will be able to see that all the required details are mentioned.
>
> As far as I understand that Hive does work 25x faster (in these particular
> cases) and around 100x faster (when we are using TEZ) when compared to
> SPARK.
>
> It will be interesting to see if Ted includes these findings while they
> are benchmarking SPARK. This is a very typical and a general used case.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 5:11 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>> ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
>> copartitioning in that case: it is one of hive's strengths.
>>
>> 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>>
>>> Hi Mich,
>>>
>>> does not Hive use map-reduce? I thought it to be so. And since I am
>>> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> are you using map-reduce with Hive?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>>>> here and is there a way we can optimize the queries in SPARK without the
>>>>> obvious hack in Query2.
>>>>>
>>>>>
>>>>> -----------------------
>>>>> ENVIRONMENT:
>>>>> -----------------------
>>>>>
>>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>>> million rows. Both the files are single gzipped csv file.
>>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>>> accessed through SPARK using HiveContext
>>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>>>
>>>>> --------------
>>>>> QUERY1:
>>>>> --------------
>>>>> select A.PK, B.FK
>>>>> from A
>>>>> left outer join B on (A.PK = B.FK)
>>>>> where B.FK is not null;
>>>>>
>>>>>
>>>>>
>>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>>>
>>>>>
>>>>> --------------
>>>>> QUERY 2:
>>>>> --------------
>>>>>
>>>>> select A.PK, B.FK
>>>>> from (select PK from A) A
>>>>> left outer join B on (A.PK = B.FK)
>>>>> where B.FK is not null;
>>>>>
>>>>> This query takes 4.5 mins in SPARK
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to