Re: Hive on Spark Engine versus Spark using Hive metastore

Koert Kuipers Tue, 02 Feb 2016 16:10:10 -0800

uuuhm with spark using Hive metastore you actually have a real programming
environment and you can write real functions, versus just being boxed into
some version of sql and limited udfs?


On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:

> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
>> Thanks Jeff.
>>
>>
>>
>> Obviously Hive is much more feature rich compared to Spark. Having said
>> that in certain areas for example where the SQL feature is available in
>> Spark, Spark seems to deliver faster.
>>
>>
>>
>> This may be:
>>
>>
>>
>> 1.    Spark does both the optimisation and execution seamlessly
>>
>> 2.    Hive on Spark has to invoke YARN that adds another layer to the
>> process
>>
>>
>>
>> Now I did some simple tests on a 100Million rows ORC table available
>> through Hive to both.
>>
>>
>>
>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>
>>
>>
>>
>>
>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>
>> 1       0       0       63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>> xxxxxxxxxx
>>
>> 5       0       4       31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>> xxxxxxxxxx
>>
>> 100000  99      999     188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>> xxxxxxxxxx
>>
>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>
>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>
>> 1       0       0       63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>> xxxxxxxxxx
>>
>> 5       0       4       31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>> xxxxxxxxxx
>>
>> 100000  99      999     188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>> xxxxxxxxxx
>>
>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>
>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>
>> 1       0       0       63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>> xxxxxxxxxx
>>
>> 5       0       4       31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>> xxxxxxxxxx
>>
>> 100000  99      999     188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>> xxxxxxxxxx
>>
>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>
>>
>>
>> So three runs returning three rows just over 50 seconds
>>
>>
>>
>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>
>>
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 100000);
>>
>> INFO  :
>>
>> Query Hive on Spark job[4] stages:
>>
>> INFO  : 4
>>
>> INFO  :
>>
>> Status: Running (Hive on Spark job[4])
>>
>> INFO  : Status: Finished successfully in 82.49 seconds
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> |                 dummy.random_string                 | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> | 1         | 0                | 0                | 63                |
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>> xxxxxxxxxx     |
>>
>> | 5         | 0                | 4                | 31                |
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>> xxxxxxxxxx     |
>>
>> | 100000    | 99               | 999              | 188               |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>> xxxxxxxxxx     |
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> 3 rows selected (82.66 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 100000);
>>
>> INFO  : Status: Finished successfully in 76.67 seconds
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> |                 dummy.random_string                 | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> | 1         | 0                | 0                | 63                |
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>> xxxxxxxxxx     |
>>
>> | 5         | 0                | 4                | 31                |
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>> xxxxxxxxxx     |
>>
>> | 100000    | 99               | 999              | 188               |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>> xxxxxxxxxx     |
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> 3 rows selected (76.835 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 100000);
>>
>> INFO  : Status: Finished successfully in 80.54 seconds
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> |                 dummy.random_string                 | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> | 1         | 0                | 0                | 63                |
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>> xxxxxxxxxx     |
>>
>> | 5         | 0                | 4                | 31                |
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>> xxxxxxxxxx     |
>>
>> | 100000    | 99               | 999              | 188               |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>> xxxxxxxxxx     |
>>
>>
>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>
>> 3 rows selected (80.718 seconds)
>>
>>
>>
>> Three runs returning the same rows in 80 seconds.
>>
>>
>>
>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>> date and that causes this lag.
>>
>>
>>
>> There are certain queries that one cannot do with Spark. Besides it does
>> not recognize CHAR fields which is a pain.
>>
>>
>>
>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>
>>          > SELECT t.calendar_month_desc, c.channel_desc,
>> SUM(s.amount_sold) AS TotalSales
>>
>>          > FROM sales s, times t, channels c
>>
>>          > WHERE s.time_id = t.time_id
>>
>>          > AND   s.channel_id = c.channel_id
>>
>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>
>>          > ;
>>
>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>
>> .
>>
>> You are likely trying to use an unsupported Hive feature.";
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>> *Sent:* 02 February 2016 23:12
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>
>>
>>
>> I think the diff is not only about which does optimization but more on
>> feature parity. Hive on Spark offers all functional features that Hive
>> offers and these features play out faster. However, Spark SQL is far from
>> offering this parity as far as I know.
>>
>>
>>
>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>> Hi,
>>
>>
>>
>> My understanding is that with Hive on Spark engine, one gets the Hive
>> optimizer and Spark query engine
>>
>>
>>
>> With spark using Hive metastore, Spark does both the optimization and
>> query engine. The only value add is that one can access the underlying Hive
>> tables from spark-sql etc
>>
>>
>>
>>
>>
>> Is this assessment correct?
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>>
>>
>
>

Re: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to