sounds like this is a one off case.

Do you have any other use case where you have Hive on MR outperforms Spark?

I did some tests on 1 billion row table getting the selectivity of a column
using Hive on MR, Hive on Spark engine and Spark running on local mode (to
keep it simple)


Hive 2, Spark 1.6.1

Results:

Hive with map-reduce --> 18  minutes
Hive on Spark engine -->  6 minutes
Spark                -->  2 minutes


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 08:43, Jörn Franke <jornfra...@gmail.com> wrote:

> I agree here.
>
> However it depends always on your use case !
>
> Best regards
>
> On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi Mahender,
>
> please ensure that for dimension tables you are enabling the broadcast
> method. You must be able to see surprising gains @12x.
>
> Overall I think that SPARK cannot figure out whether to scan all the
> columns in a table or just the ones which are being used causing this
> issue.
>
> When you start using HIVE with ORC and TEZ  (*) you will see some amazing
> results, and leaves SPARK way way behind. So pretty much you need to have
> your data in memory for matching the performance claims of SPARK and the
> advantage in that case you are getting is not because of SPARK algorithms
> but just fast I/O from RAM. The advantage of SPARK is that it makes
> accessible analytics, querying, and streaming frameworks together.
>
>
> In case you are following the optimisations mentioned in the link you
> hardly have any reasons for using SPARK SQL:
> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
> imagine being able to do all of that without having machines which requires
> huge RAM, or in short you are achieving those performance gains using
> commodity low cost systems around which HADOOP was designed.
>
> I think that Hortonworks is giving a stiff competition here :)
>
> Regards,
> Gourav Sengupta
>
> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
> mahender.bigd...@outlook.com> wrote:
>
>> +1,
>>
>> Even see performance degradation while comparing SPark SQL with Hive.
>> We have table of 260 columns. We have executed in hive and SPARK. In
>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4
>> mins of time.
>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>
>> Could you print out the sql execution plan? My guess is about broadcast
>> join.
>>
>>
>>
>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com>
>> gourav.sengu...@gmail.com> wrote:
>>
>> Hi,
>>
>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
>> and is there a way we can optimize the queries in SPARK without the obvious
>> hack in Query2.
>>
>>
>> -----------------------
>> ENVIRONMENT:
>> -----------------------
>>
>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>> million rows. Both the files are single gzipped csv file.
>> > Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --------------
>> QUERY1:
>> --------------
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
>>
>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>
>>
>> --------------
>> QUERY 2:
>> --------------
>>
>> select A.PK, B.FK
>> from (select PK from A) A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>> This query takes 4.5 mins in SPARK
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>>
>

Reply via email to