Re: Hive on Spark Engine versus Spark using Hive metastore

Koert Kuipers Tue, 02 Feb 2016 20:47:30 -0800

yes. the ability to start with sql but when needed expand into more full
blown programming languages, machine learning etc. is a huge plus. after
all this is a cluster, and just querying or extracting data to move it off
the cluster into some other analytics tool is going to be very inefficient
and defeats the purpose to some extend of having a cluster. so you want to
have a capability to do more than queries and etl. and spark is that
ticket. hive is simply not. well not for anything somewhat complex anyhow.



On Tue, Feb 2, 2016 at 8:06 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Hi,
>
>
>
> Are you referring to spark-shell with Scala, Python and others?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.    Spark does both the optimisation and execution seamlessly
>
> 2.    Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.563 seconds, Fetched 3 row(s)
>
>
>
> So three runs returning three rows just over 50 seconds
>
>
>
> *Hive 1.2.1 on spark 1.3.1 execution engine*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  :
>
> Query Hive on Spark job[4] stages:
>
> INFO  : 4
>
> INFO  :
>
> Status: Running (Hive on Spark job[4])
>
> INFO  : Status: Finished successfully in 82.49 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (82.66 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  : Status: Finished successfully in 76.67 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (76.835 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  : Status: Finished successfully in 80.54 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (80.718 seconds)
>
>
>
> Three runs returning the same rows in 80 seconds.
>
>
>
> It is possible that My Spark engine with Hive is 1.3.1 which is out of
> date and that causes this lag.
>
>
>
> There are certain queries that one cannot do with Spark. Besides it does
> not recognize CHAR fields which is a pain.
>
>
>
> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>
>          > SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>          > FROM sales s, times t, channels c
>
>          > WHERE s.time_id = t.time_id
>
>          > AND   s.channel_id = c.channel_id
>
>          > GROUP BY t.calendar_month_desc, c.channel_desc
>
>          > ;
>
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>
> .
>
> You are likely trying to use an unsupported Hive feature.";
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 02 February 2016 23:12
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers all functional features that Hive
> offers and these features play out faster. However, Spark SQL is far from
> offering this parity as far as I know.
>
>
>
> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
>
>
>
>

Re: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to