yeah but have you ever seen somewhat write a real analytical program in hive? how? where are the basic abstractions to wrap up a large amount of operations (joins, groupby's) into a single function call? where are the tools to write nice unit test for that?
for example in spark i can write a DataFrame => DataFrame that internally does many joins, groupBys and complex operations. all unit tested and perfectly re-usable. and in hive? copy paste round sql queries? thats just dangerous. On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > Hive has numerous extension points, you are not boxed in by a long shot. > > > On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote: > >> uuuhm with spark using Hive metastore you actually have a real >> programming environment and you can write real functions, versus just being >> boxed into some version of sql and limited udfs? >> >> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: >> >>> When comparing the performance, you need to do it apple vs apple. In >>> another thread, you mentioned that Hive on Spark is much slower than Spark >>> SQL. However, you configured Hive such that only two tasks can run in >>> parallel. However, you didn't provide information on how much Spark SQL is >>> utilizing. Thus, it's hard to tell whether it's just a configuration >>> problem in your Hive or Spark SQL is indeed faster. You should be able to >>> see the resource usage in YARN resource manage URL. >>> >>> --Xuefu >>> >>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> >>> wrote: >>> >>>> Thanks Jeff. >>>> >>>> >>>> >>>> Obviously Hive is much more feature rich compared to Spark. Having said >>>> that in certain areas for example where the SQL feature is available in >>>> Spark, Spark seems to deliver faster. >>>> >>>> >>>> >>>> This may be: >>>> >>>> >>>> >>>> 1. Spark does both the optimisation and execution seamlessly >>>> >>>> 2. Hive on Spark has to invoke YARN that adds another layer to the >>>> process >>>> >>>> >>>> >>>> Now I did some simple tests on a 100Million rows ORC table available >>>> through Hive to both. >>>> >>>> >>>> >>>> *Spark 1.5.2 on Hive 1.2.1 Metastore* >>>> >>>> >>>> >>>> >>>> >>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>> >>>> 1 0 0 63 >>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>> xxxxxxxxxx >>>> >>>> 5 0 4 31 >>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>> xxxxxxxxxx >>>> >>>> 100000 99 999 188 >>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>> xxxxxxxxxx >>>> >>>> Time taken: 50.805 seconds, Fetched 3 row(s) >>>> >>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>> >>>> 1 0 0 63 >>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>> xxxxxxxxxx >>>> >>>> 5 0 4 31 >>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>> xxxxxxxxxx >>>> >>>> 100000 99 999 188 >>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>> xxxxxxxxxx >>>> >>>> Time taken: 50.358 seconds, Fetched 3 row(s) >>>> >>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>> >>>> 1 0 0 63 >>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>> xxxxxxxxxx >>>> >>>> 5 0 4 31 >>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>> xxxxxxxxxx >>>> >>>> 100000 99 999 188 >>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>> xxxxxxxxxx >>>> >>>> Time taken: 50.563 seconds, Fetched 3 row(s) >>>> >>>> >>>> >>>> So three runs returning three rows just over 50 seconds >>>> >>>> >>>> >>>> *Hive 1.2.1 on spark 1.3.1 execution engine* >>>> >>>> >>>> >>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>> (1, 5, 100000); >>>> >>>> INFO : >>>> >>>> Query Hive on Spark job[4] stages: >>>> >>>> INFO : 4 >>>> >>>> INFO : >>>> >>>> Status: Running (Hive on Spark job[4]) >>>> >>>> INFO : Status: Finished successfully in 82.49 seconds >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>>> | dummy.random_string | dummy.small_vc | >>>> dummy.padding | >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> | 1 | 0 | 0 | 63 | >>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>> xxxxxxxxxx | >>>> >>>> | 5 | 0 | 4 | 31 | >>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>> xxxxxxxxxx | >>>> >>>> | 100000 | 99 | 999 | 188 | >>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>> xxxxxxxxxx | >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> 3 rows selected (82.66 seconds) >>>> >>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>> (1, 5, 100000); >>>> >>>> INFO : Status: Finished successfully in 76.67 seconds >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>>> | dummy.random_string | dummy.small_vc | >>>> dummy.padding | >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> | 1 | 0 | 0 | 63 | >>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>> xxxxxxxxxx | >>>> >>>> | 5 | 0 | 4 | 31 | >>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>> xxxxxxxxxx | >>>> >>>> | 100000 | 99 | 999 | 188 | >>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>> xxxxxxxxxx | >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> 3 rows selected (76.835 seconds) >>>> >>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>> (1, 5, 100000); >>>> >>>> INFO : Status: Finished successfully in 80.54 seconds >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>>> | dummy.random_string | dummy.small_vc | >>>> dummy.padding | >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> | 1 | 0 | 0 | 63 | >>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>> xxxxxxxxxx | >>>> >>>> | 5 | 0 | 4 | 31 | >>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>> xxxxxxxxxx | >>>> >>>> | 100000 | 99 | 999 | 188 | >>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>> xxxxxxxxxx | >>>> >>>> >>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>> >>>> 3 rows selected (80.718 seconds) >>>> >>>> >>>> >>>> Three runs returning the same rows in 80 seconds. >>>> >>>> >>>> >>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of >>>> date and that causes this lag. >>>> >>>> >>>> >>>> There are certain queries that one cannot do with Spark. Besides it >>>> does not recognize CHAR fields which is a pain. >>>> >>>> >>>> >>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS* >>>> >>>> > SELECT t.calendar_month_desc, c.channel_desc, >>>> SUM(s.amount_sold) AS TotalSales >>>> >>>> > FROM sales s, times t, channels c >>>> >>>> > WHERE s.time_id = t.time_id >>>> >>>> > AND s.channel_id = c.channel_id >>>> >>>> > GROUP BY t.calendar_month_desc, c.channel_desc >>>> >>>> > ; >>>> >>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 >>>> >>>> . >>>> >>>> You are likely trying to use an unsupported Hive feature."; >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> *Sybase ASE 15 Gold Medal Award 2008* >>>> >>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>>> >>>> >>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>>> >>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase >>>> ASE 15", ISBN 978-0-9563693-0-7*. >>>> >>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>>> 978-0-9759693-0-4* >>>> >>>> *Publications due shortly:* >>>> >>>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>>> 978-0-9563693-3-8 >>>> >>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, >>>> volume >>>> one out shortly >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> NOTE: The information in this email is proprietary and confidential. >>>> This message is for the designated recipient only, if you are not the >>>> intended recipient, you should destroy it immediately. Any information in >>>> this message shall not be understood as given or endorsed by Peridale >>>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>>> stated. It is the responsibility of the recipient to ensure that this email >>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>>> nor their employees accept any responsibility. >>>> >>>> >>>> >>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] >>>> *Sent:* 02 February 2016 23:12 >>>> *To:* user@hive.apache.org >>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore >>>> >>>> >>>> >>>> I think the diff is not only about which does optimization but more on >>>> feature parity. Hive on Spark offers all functional features that Hive >>>> offers and these features play out faster. However, Spark SQL is far from >>>> offering this parity as far as I know. >>>> >>>> >>>> >>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> My understanding is that with Hive on Spark engine, one gets the Hive >>>> optimizer and Spark query engine >>>> >>>> >>>> >>>> With spark using Hive metastore, Spark does both the optimization and >>>> query engine. The only value add is that one can access the underlying Hive >>>> tables from spark-sql etc >>>> >>>> >>>> >>>> >>>> >>>> Is this assessment correct? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> *Sybase ASE 15 Gold Medal Award 2008* >>>> >>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>>> >>>> >>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>>> >>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase >>>> ASE 15", ISBN 978-0-9563693-0-7*. >>>> >>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>>> 978-0-9759693-0-4* >>>> >>>> *Publications due shortly:* >>>> >>>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>>> 978-0-9563693-3-8 >>>> >>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, >>>> volume >>>> one out shortly >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> NOTE: The information in this email is proprietary and confidential. >>>> This message is for the designated recipient only, if you are not the >>>> intended recipient, you should destroy it immediately. Any information in >>>> this message shall not be understood as given or endorsed by Peridale >>>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>>> stated. It is the responsibility of the recipient to ensure that this email >>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>>> nor their employees accept any responsibility. >>>> >>>> >>>> >>>> >>>> >>> >>> >> > > -- > Sorry this was sent from mobile. Will do less grammar and spell check than > usual. >