uuuhm with spark using Hive metastore you actually have a real programming environment and you can write real functions, versus just being boxed into some version of sql and limited udfs?
On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: > When comparing the performance, you need to do it apple vs apple. In > another thread, you mentioned that Hive on Spark is much slower than Spark > SQL. However, you configured Hive such that only two tasks can run in > parallel. However, you didn't provide information on how much Spark SQL is > utilizing. Thus, it's hard to tell whether it's just a configuration > problem in your Hive or Spark SQL is indeed faster. You should be able to > see the resource usage in YARN resource manage URL. > > --Xuefu > > On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > >> Thanks Jeff. >> >> >> >> Obviously Hive is much more feature rich compared to Spark. Having said >> that in certain areas for example where the SQL feature is available in >> Spark, Spark seems to deliver faster. >> >> >> >> This may be: >> >> >> >> 1. Spark does both the optimisation and execution seamlessly >> >> 2. Hive on Spark has to invoke YARN that adds another layer to the >> process >> >> >> >> Now I did some simple tests on a 100Million rows ORC table available >> through Hive to both. >> >> >> >> *Spark 1.5.2 on Hive 1.2.1 Metastore* >> >> >> >> >> >> spark-sql> select * from dummy where id in (1, 5, 100000); >> >> 1 0 0 63 >> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >> xxxxxxxxxx >> >> 5 0 4 31 >> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >> xxxxxxxxxx >> >> 100000 99 999 188 >> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >> xxxxxxxxxx >> >> Time taken: 50.805 seconds, Fetched 3 row(s) >> >> spark-sql> select * from dummy where id in (1, 5, 100000); >> >> 1 0 0 63 >> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >> xxxxxxxxxx >> >> 5 0 4 31 >> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >> xxxxxxxxxx >> >> 100000 99 999 188 >> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >> xxxxxxxxxx >> >> Time taken: 50.358 seconds, Fetched 3 row(s) >> >> spark-sql> select * from dummy where id in (1, 5, 100000); >> >> 1 0 0 63 >> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >> xxxxxxxxxx >> >> 5 0 4 31 >> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >> xxxxxxxxxx >> >> 100000 99 999 188 >> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >> xxxxxxxxxx >> >> Time taken: 50.563 seconds, Fetched 3 row(s) >> >> >> >> So three runs returning three rows just over 50 seconds >> >> >> >> *Hive 1.2.1 on spark 1.3.1 execution engine* >> >> >> >> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >> (1, 5, 100000); >> >> INFO : >> >> Query Hive on Spark job[4] stages: >> >> INFO : 4 >> >> INFO : >> >> Status: Running (Hive on Spark job[4]) >> >> INFO : Status: Finished successfully in 82.49 seconds >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >> | dummy.random_string | dummy.small_vc | >> dummy.padding | >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> | 1 | 0 | 0 | 63 | >> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >> xxxxxxxxxx | >> >> | 5 | 0 | 4 | 31 | >> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >> xxxxxxxxxx | >> >> | 100000 | 99 | 999 | 188 | >> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >> xxxxxxxxxx | >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> 3 rows selected (82.66 seconds) >> >> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >> (1, 5, 100000); >> >> INFO : Status: Finished successfully in 76.67 seconds >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >> | dummy.random_string | dummy.small_vc | >> dummy.padding | >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> | 1 | 0 | 0 | 63 | >> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >> xxxxxxxxxx | >> >> | 5 | 0 | 4 | 31 | >> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >> xxxxxxxxxx | >> >> | 100000 | 99 | 999 | 188 | >> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >> xxxxxxxxxx | >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> 3 rows selected (76.835 seconds) >> >> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >> (1, 5, 100000); >> >> INFO : Status: Finished successfully in 80.54 seconds >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >> | dummy.random_string | dummy.small_vc | >> dummy.padding | >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> | 1 | 0 | 0 | 63 | >> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >> xxxxxxxxxx | >> >> | 5 | 0 | 4 | 31 | >> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >> xxxxxxxxxx | >> >> | 100000 | 99 | 999 | 188 | >> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >> xxxxxxxxxx | >> >> >> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >> >> 3 rows selected (80.718 seconds) >> >> >> >> Three runs returning the same rows in 80 seconds. >> >> >> >> It is possible that My Spark engine with Hive is 1.3.1 which is out of >> date and that causes this lag. >> >> >> >> There are certain queries that one cannot do with Spark. Besides it does >> not recognize CHAR fields which is a pain. >> >> >> >> spark-sql> *CREATE TEMPORARY TABLE tmp AS* >> >> > SELECT t.calendar_month_desc, c.channel_desc, >> SUM(s.amount_sold) AS TotalSales >> >> > FROM sales s, times t, channels c >> >> > WHERE s.time_id = t.time_id >> >> > AND s.channel_id = c.channel_id >> >> > GROUP BY t.calendar_month_desc, c.channel_desc >> >> > ; >> >> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 >> >> . >> >> You are likely trying to use an unsupported Hive feature."; >> >> >> >> >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their >> employees accept any responsibility. >> >> >> >> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] >> *Sent:* 02 February 2016 23:12 >> *To:* user@hive.apache.org >> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore >> >> >> >> I think the diff is not only about which does optimization but more on >> feature parity. Hive on Spark offers all functional features that Hive >> offers and these features play out faster. However, Spark SQL is far from >> offering this parity as far as I know. >> >> >> >> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> >> wrote: >> >> Hi, >> >> >> >> My understanding is that with Hive on Spark engine, one gets the Hive >> optimizer and Spark query engine >> >> >> >> With spark using Hive metastore, Spark does both the optimization and >> query engine. The only value add is that one can access the underlying Hive >> tables from spark-sql etc >> >> >> >> >> >> Is this assessment correct? >> >> >> >> >> >> >> >> Thanks >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their >> employees accept any responsibility. >> >> >> >> >> > >