Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Elliot West
t; xx |
>>>>>
>>>>> | 10| 99   | 999  | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (76.835 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 10);
>>>>>
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> | dummy.random_string | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | 1 | 0| 0| 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>> xx |
>>>>>
>>>>> | 5 | 0| 4| 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>> xx |
>>>>>
>>>>> | 10| 99   | 999  | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (80.718 seconds)
>>>>>
>>>>>
>>>>>
>>>>> Three runs returning the same rows in 80 seconds.
>>>>>
>>>>>
>>>>>
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>>> date and that causes this lag.
>>>>>
>>>>>
>>>>>
>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>> does not recognize CHAR fields which is a pain.
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>
>>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>
>>>>>  > FROM sales s, times t, channels c
>>>>>
>>>>>  > WHERE s.time_id = t.time_id
>>>>>
>>>>>  > AND   s.channel_id = c.channel_id
>>>>>
>>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>
>>>>>  > ;
>>>>>
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>>
>>>>> .
>>>>>
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>>> *Sent:* 02 February 2016 23:12
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>>
>>>>>
>>>>>
>>>>> I think the diff is not only about which does optimization but more on
>>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>>> offers and these features play out faster. However, Spark SQL is far from
>>>>> offering this parity as far as I know.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh 
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>>> optimizer and Spark query engine
>>>>>
>>>>>
>>>>>
>>>>> With spark using Hive metastore, Spark does both the optimization and
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive
>>>>> tables from spark-sql etc
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is this assessment correct?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>


RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Mich Talebzadeh
Hi Edward,

 

There is another angle to it as well. Fit for purpose.

 

We are currently migrating from a propriety DW on SAN to Hive on JBOD. It is 
going smoothly. It will save us $$ in licensing fees in times where the 
technology and storage dollars are at premium.

 

Our DBAs that look after Oracle, SAP ASES and others are comfortable with Hive. 
They can look after the metastore (on Oracle) and working with me for HA for 
metastore and Hive serever2 in line with the standard for other databases.

 

I am sure if we had started with Spark, that would have worked but what the 
hec. We have MongoDB as well independent of HDFS.

 

These arguments about what is better or worse is the one we have had for years 
about Oracle, Sybase, MSSQL etc. I believe Hive is better for us because I 
think Hive. If I was more familiar with Spark, I am sure that would have been 
the opposite.

 

We can go in circles. Religious arguments really.

 

 

HTH,

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: 04 February 2016 17:41
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

Hive is not the correct tool for every problem. Use the tool that makes the 
most sense for your problem and your experience. 

 

Many people like hive because it is generally applicable. In my case study for 
the hive book I highlighted many smart capably organizations use hive. 

Your argument is totally valid. You like X better because X works for you. You 
don't need to 'preach' hear we all know hive has it's limits. 

 

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers mailto:ko...@tresata.com> > wrote:

Is the sky the limit? I know udfs can be used inside hive, like lambas 
basically i assume, and i will assume you have something similar for 
aggregations. But that's just abstractions inside a single map or reduce phase, 
pretty low level stuff. What you really need is abstractions around many map 
and reduce phases, because that is the level an algo is expressed at.

For example when doing logistic regression you want to be able to do something 
like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do many 
map and reduce steps internally (or even be defined at a higher level and 
compile into those steps). What is the equivalent in hive? Copy pasting crucial 
parts of the algo around while using udfs is just not the same thing in terms 
of reusability and abstraction. Its the opposite of keeping it DRY.

On Feb 3, 2016 1:06 AM, "Ryan Harris" mailto:ryan.har...@zionsbancorp.com> > wrote:

https://github.com/myui/hivemall

 

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

 

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

 

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO. 

 

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apache.org <mailto:user@hive.ap

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers
fair enough

On Thu, Feb 4, 2016 at 12:41 PM, Edward Capriolo 
wrote:

> Hive is not the correct tool for every problem. Use the tool that makes
> the most sense for your problem and your experience.
>
> Many people like hive because it is generally applicable. In my case study
> for the hive book I highlighted many smart capably organizations use hive.
>
> Your argument is totally valid. You like X better because X works for you.
> You don't need to 'preach' hear we all know hive has it's limits.
>
> On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers  wrote:
>
>> Is the sky the limit? I know udfs can be used inside hive, like lambas
>> basically i assume, and i will assume you have something similar for
>> aggregations. But that's just abstractions inside a single map or reduce
>> phase, pretty low level stuff. What you really need is abstractions around
>> many map and reduce phases, because that is the level an algo is expressed
>> at.
>>
>> For example when doing logistic regression you want to be able to do
>> something like:
>> read("somefile").train(settings).write("model")
>> Here train is an eternally defined method that is well tested and could
>> do many map and reduce steps internally (or even be defined at a higher
>> level and compile into those steps). What is the equivalent in hive? Copy
>> pasting crucial parts of the algo around while using udfs is just not the
>> same thing in terms of reusability and abstraction. Its the opposite of
>> keeping it DRY.
>> On Feb 3, 2016 1:06 AM, "Ryan Harris" 
>> wrote:
>>
>>> https://github.com/myui/hivemall
>>>
>>>
>>>
>>> as long as you are comfortable with java UDFs, the sky is really the
>>> limit...it's not for everyone and spark does have many advantages, but they
>>> are two tools that can complement each other in numerous ways.
>>>
>>>
>>>
>>> I don't know that there is necessarily a universal "better" for how to
>>> use spark as an execution engine (or if spark is necessarily the **best**
>>> execution engine for any given hive job).
>>>
>>>
>>>
>>> The reality is that once you start factoring in the numerous tuning
>>> parameters of the systems and jobs there probably isn't a clear answer.
>>> For some queries, the Catalyst optimizer may do a better job...is it going
>>> to do a better job with ORC based data? less likely IMO.
>>>
>>>
>>>
>>> *From:* Koert Kuipers [mailto:ko...@tresata.com]
>>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> yeah but have you ever seen somewhat write a real analytical program in
>>> hive? how? where are the basic abstractions to wrap up a large amount of
>>> operations (joins, groupby's) into a single function call? where are the
>>> tools to write nice unit test for that?
>>>
>>> for example in spark i can write a DataFrame => DataFrame that
>>> internally does many joins, groupBys and complex operations. all unit
>>> tested and perfectly re-usable. and in hive? copy paste round sql queries?
>>> thats just dangerous.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
>>> wrote:
>>>
>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>
>>>
>>>
>>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>> see the resource usage in YARN resource manage URL.
>>>
>>> --Xuefu
>>>
>>>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Edward Capriolo
Hive is not the correct tool for every problem. Use the tool that makes the
most sense for your problem and your experience.

Many people like hive because it is generally applicable. In my case study
for the hive book I highlighted many smart capably organizations use hive.

Your argument is totally valid. You like X better because X works for you.
You don't need to 'preach' hear we all know hive has it's limits.

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers  wrote:

> Is the sky the limit? I know udfs can be used inside hive, like lambas
> basically i assume, and i will assume you have something similar for
> aggregations. But that's just abstractions inside a single map or reduce
> phase, pretty low level stuff. What you really need is abstractions around
> many map and reduce phases, because that is the level an algo is expressed
> at.
>
> For example when doing logistic regression you want to be able to do
> something like:
> read("somefile").train(settings).write("model")
> Here train is an eternally defined method that is well tested and could do
> many map and reduce steps internally (or even be defined at a higher level
> and compile into those steps). What is the equivalent in hive? Copy pasting
> crucial parts of the algo around while using udfs is just not the same
> thing in terms of reusability and abstraction. Its the opposite of keeping
> it DRY.
> On Feb 3, 2016 1:06 AM, "Ryan Harris" 
> wrote:
>
>> https://github.com/myui/hivemall
>>
>>
>>
>> as long as you are comfortable with java UDFs, the sky is really the
>> limit...it's not for everyone and spark does have many advantages, but they
>> are two tools that can complement each other in numerous ways.
>>
>>
>>
>> I don't know that there is necessarily a universal "better" for how to
>> use spark as an execution engine (or if spark is necessarily the **best**
>> execution engine for any given hive job).
>>
>>
>>
>> The reality is that once you start factoring in the numerous tuning
>> parameters of the systems and jobs there probably isn't a clear answer.
>> For some queries, the Catalyst optimizer may do a better job...is it going
>> to do a better job with ORC based data? less likely IMO.
>>
>>
>>
>> *From:* Koert Kuipers [mailto:ko...@tresata.com]
>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>
>>
>>
>> yeah but have you ever seen somewhat write a real analytical program in
>> hive? how? where are the basic abstractions to wrap up a large amount of
>> operations (joins, groupby's) into a single function call? where are the
>> tools to write nice unit test for that?
>>
>> for example in spark i can write a DataFrame => DataFrame that internally
>> does many joins, groupBys and complex operations. all unit tested and
>> perfectly re-usable. and in hive? copy paste round sql queries? thats just
>> dangerous.
>>
>>
>>
>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
>> wrote:
>>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of sql and limited udfs?
>>
>>
>>
>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>
>> When comparing the performance, you need to do it apple vs apple. In
>> another thread, you mentioned that Hive on Spark is much slower than Spark
>> SQL. However, you configured Hive such that only two tasks can run in
>> parallel. However, you didn't provide information on how much Spark SQL is
>> utilizing. Thus, it's hard to tell whether it's just a configuration
>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>> see the resource usage in YARN resource manage URL.
>>
>> --Xuefu
>>
>>
>>
>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Jeff.
>>
>>
>>
>> Obviously Hive is much more feature rich compared to Spark. Having said
>> that in certain areas for example where the SQL feature is available in
>> Spark, Spark seems to deliver faster.
>>
>>
>>
>> This may be:
>>
>>
>>
>> 1.Spark does both the optimisation an

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers
Is the sky the limit? I know udfs can be used inside hive, like lambas
basically i assume, and i will assume you have something similar for
aggregations. But that's just abstractions inside a single map or reduce
phase, pretty low level stuff. What you really need is abstractions around
many map and reduce phases, because that is the level an algo is expressed
at.

For example when doing logistic regression you want to be able to do
something like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do
many map and reduce steps internally (or even be defined at a higher level
and compile into those steps). What is the equivalent in hive? Copy pasting
crucial parts of the algo around while using udfs is just not the same
thing in terms of reusability and abstraction. Its the opposite of keeping
it DRY.
On Feb 3, 2016 1:06 AM, "Ryan Harris"  wrote:

> https://github.com/myui/hivemall
>
>
>
> as long as you are comfortable with java UDFs, the sky is really the
> limit...it's not for everyone and spark does have many advantages, but they
> are two tools that can complement each other in numerous ways.
>
>
>
> I don't know that there is necessarily a universal "better" for how to use
> spark as an execution engine (or if spark is necessarily the **best**
> execution engine for any given hive job).
>
>
>
> The reality is that once you start factoring in the numerous tuning
> parameters of the systems and jobs there probably isn't a clear answer.
> For some queries, the Catalyst optimizer may do a better job...is it going
> to do a better job with ORC based data? less likely IMO.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Tuesday, February 02, 2016 9:50 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
>
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
>
> Hive has numerous extension points, you are not boxed in by a long shot.
>
>
>
> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Edward Capriolo
 INFO  : 2016-02-03 21:26:08,310 Stage-3 map = 47%,  reduce = 0%,
> Cumulative CPU 38.78 sec
>
> INFO  : 2016-02-03 21:26:11,408 Stage-3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 100%,
> Cumulative CPU 50.01 sec
>
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 10 msec
>
> INFO  : Ended Job = job_1454534517374_0002
>
> ++-+-+--+
>
> | t.calendar_month_desc  | c.channel_desc  | totalsales  |
>
> ++-+-+--+
>
> ++-+-+--+
>
> 150 rows selected (85.67 seconds)
>
>
>
> *3)**Spark on Hive engine completes in 267 sec*
>
> spark-sql> SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Time taken: 267.138 seconds, Fetched 150 row(s)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 03 February 2016 16:21
> *To:* user@hive.apache.org
> *Subject:* RE: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> OK thanks. These are my new ENV settings based upon the availability of
> resources
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
> These are the new runs after these settings:
>
>
>
> *Spark on Hive (3 consecutive runs)*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 47.987 seconds, Fetched 3 row(s)
>
>
>
> Around 48 seconds
>
>
>
> *Hive on Spark 1.3.1*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in
> (1, 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[2] stages:
>
> INFO  : 2
>
> INFO  :
>
> Status: Running (Hive on Spark job[2])
>
> INFO  : Job Progress Format
>
> CurrentTime 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Stephen Sprague
 Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 100%,
> Cumulative CPU 50.01 sec
>
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 10 msec
>
> INFO  : Ended Job = job_1454534517374_0002
>
> ++-+-+--+
>
> | t.calendar_month_desc  | c.channel_desc  | totalsales  |
>
> ++-+-+--+
>
> ++-+-+--+
>
> 150 rows selected (85.67 seconds)
>
>
>
> *3)**Spark on Hive engine completes in 267 sec*
>
> spark-sql> SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Time taken: 267.138 seconds, Fetched 150 row(s)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 03 February 2016 16:21
> *To:* user@hive.apache.org
> *Subject:* RE: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> OK thanks. These are my new ENV settings based upon the availability of
> resources
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
> These are the new runs after these settings:
>
>
>
> *Spark on Hive (3 consecutive runs)*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 47.987 seconds, Fetched 3 row(s)
>
>
>
> Around 48 seconds
>
>
>
> *Hive on Spark 1.3.1*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in
> (1, 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[2] stages:
>
> INFO  : 2
>
> INFO  :
>
> Status: Running (Hive on Spark job[2])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttem

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh
ted (85.67 seconds)

 

3)Spark on Hive engine completes in 267 sec

spark-sql> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
TotalSales

 > FROM sales s, times t, channels c

 > WHERE s.time_id = t.time_id

 > AND   s.channel_id = c.channel_id

 > GROUP BY t.calendar_month_desc, c.channel_desc

 > ;

Time taken: 267.138 seconds, Fetched 150 row(s)

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 03 February 2016 16:21
To: user@hive.apache.org
Subject: RE: Hive on Spark Engine versus Spark using Hive metastore

 

OK thanks. These are my new ENV settings based upon the availability of 
resources

 

export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers (Default: 1).

export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) 
(Default: 1G)

export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) (Default: 
512 Mb)

 

These are the new runs after these settings:

 

Spark on Hive (3 consecutive runs)

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 47.987 seconds, Fetched 3 row(s)

 

Around 48 seconds

 

Hive on Spark 1.3.1

 

0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

INFO  : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18

INFO  : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18

INFO  : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18

INFO  : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18

INFO  : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18

INFO  : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished

INFO  : Status: Finished successfully in 36.88 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUD

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh
OK thanks. These are my new ENV settings based upon the availability of 
resources

 

export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers (Default: 1).

export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) 
(Default: 1G)

export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) (Default: 
512 Mb)

 

These are the new runs after these settings:

 

Spark on Hive (3 consecutive runs)

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 47.987 seconds, Fetched 3 row(s)

 

Around 48 seconds

 

Hive on Spark 1.3.1

 

0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

INFO  : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18

INFO  : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18

INFO  : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18

INFO  : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18

INFO  : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18

INFO  : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished

INFO  : Status: Finished successfully in 36.88 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
xx |

| 10| 99   | 999  | 188   | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  | 
xx |

+---+--+--+---+-+-++--+

3 rows selected (37.161 seconds)

 

Around 37 seconds

 

Interesting results

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 03 February 2016 12:47
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark usi

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Koert Kuipers
dding  |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | 1 | 0| 0|
>>>>>>> 63| rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |  1  | xx |
>>>>>>>
>>>>>>> | 5 | 0| 4|
>>>>>>> 31| vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |  5  | xx |
>>>>>>>
>>>>>>> | 10| 99   | 999  |
>>>>>>> 188   | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> | 10  | xx |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> 3 rows selected (82.66 seconds)
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>>> in (1, 5, 10);
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  | dummy.random_string 
>>>>>>> |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | 1 | 0| 0|
>>>>>>> 63| rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |  1  | xx |
>>>>>>>
>>>>>>> | 5 | 0| 4|
>>>>>>> 31| vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |  5  | xx |
>>>>>>>
>>>>>>> | 10| 99   | 999  |
>>>>>>> 188   | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> | 10  | xx |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> 3 rows selected (76.835 seconds)
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>>> in (1, 5, 10);
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>>>
>>>>>>>
>>>>>>> +---+--+----------+-------+-----+-++--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  | dummy.random_string 
>>>>>>> |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | 1 | 0| 0|
>>>>>>> 63| rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBG

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Edward Capriolo
: Status: Finished successfully in 76.67 seconds
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>> dummy.randomised  | dummy.random_string |
>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> | 1 | 0| 0| 63
>>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>>> xx |
>>>>>>
>>>>>> | 5 | 0| 4| 31
>>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>>> xx |
>>>>>>
>>>>>> | 10| 99   | 999  | 188
>>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>>> xx |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> 3 rows selected (76.835 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>> in (1, 5, 10);
>>>>>>
>>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>> dummy.randomised  | dummy.random_string |
>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-+------------+--+
>>>>>>
>>>>>> | 1 | 0| 0| 63
>>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>>> xx |
>>>>>>
>>>>>> | 5 | 0| 4| 31
>>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>>> xx |
>>>>>>
>>>>>> | 10| 99   | 999  | 188
>>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>>> xx |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> 3 rows selected (80.718 seconds)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Three runs returning the same rows in 80 seconds.
>>>>>>
>>>>>>
>>>>>>
>>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out
>>>>>> of date and that causes this lag.
>>>>>>
>>>>>>
>>>>>>
>>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>>> does not recognize CHAR fields which is a pain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>>
>>>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>>
>>>>>>  > FROM sales s, times t, channels c
>>>>>>
>>>>>>  >

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Xuefu Zhang
In YARN or standalone mode, you can set spark.executor.cores to utilize all
cores on the node. You can also set spark.executor.memory to allocate
memory for Spark to use. Once you do this, you may only have two executors
to run your map tasks, but each core in each executor can take up one task,
increasing parallelism. With this, the eventually limit may come down to
the bandwidth of your disks in the cluster.

Having said that, a two-node cluster isn't really big enough to do
performance benchmark. Nevertheless, you still need to configure properly
to make full use of the cluster.

--Xuefu

On Wed, Feb 3, 2016 at 1:25 AM, Mich Talebzadeh  wrote:

> Hi Jeff,
>
>
>
> I only have a two node cluster. Is there anyway one can simulate
> additional parallel runs in such an environment thus having more than two
> maps?
>
>
>
> thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 03 February 2016 02:39
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> Yes, regardless what spark mode you're running in, from Spark AM webui,
> you should be able to see how many task are concurrently running. I'm a
> little surprised to see that your Hive configuration only allows 2 map
> tasks to run in parallel. If your cluster has the capacity, you should
> parallelize all the tasks to achieve optimal performance. Since I don't
> know your Spark SQL configuration, I cannot tell how much parallelism you
> have over there. Thus, I'm not sure if your comparison is valid.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh 
> wrote:
>
> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiarie

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh
Hi all,

 

 

Thanks for all the comments on this thread.

 

The question I put was simply to rectify technically the approaches with Spark 
on Hive metastore and Hive using Spark engine.

 

The fact that we have both the benefits of Hive and Spark is tremendous. They 
both offer in their own way many opportunities.

 

Hive is billed as a Data Warehouse (DW)  on HDFS. In that respect it does a 
good job. Among many it offers many developers who are familiar with SQL to be 
productive immediacy. This should not be underestimated. You can set up your 
copy of your RDBMS table in Hive in no time and use Sqoop to get the table data 
into Hive table practically in one command. For many this is the great 
attraction of Hive that can be summarised as: 

 

* Leverage existing SQL skills on Big Data. 

* You have a choice of metastore for Hive including MySql, Oracle, 
Sybase and others. 

* You have a choice of plug ins for your engine (MR, Spark, Tez)

* Ability to do real time analytics on Hive by sending real time 
transactional movements from RDBMS tables to Hive via the existing replication 
technologies. This is very useful

* Use Sqoop to push data back to DW or RDBMS table

 

One can argue that in a DW the speed is not necessarily the overriding factor. 
It does not matter whether a job finishes in two hours or 2.5 hours. Granted 
some commercial DW solutions can do the job much faster but at what cost in 
terms of multiplexing and paying the licensing fees. Hive is an attractive 
proposition here. 

 

I can live with most of Hive shortcomings but would like to see the following:

 

* Hive has the ability to create multiple EXTERRNAL index types on 
columns. But they are never used. It would be great if they can be incorporated 
in what they are supposed to use. That will speed up processing

* It will be awesome to have the ability to have some dialect of isql, 
PL/SQL capabilities that allow local variables, conditional statements etc to 
be used in Hive much like other DW without using Shell scripting, Pig and other 
tools 

 

 

Spark is great especially for those familiar with Scala and others languages 
(additional skill set) that can leverage Spark shell. However, again it comes 
at a price of having available memory which is not always the case. Point 
queries are great. However, if you bring back tons of rows then the performance 
degrades as it has to spill to disk. 

 

Big Data space is getting crowded with a lot of products and auxiliary 
products. I can see the potential of Spark for other exploratory work. Having 
said that in fairness, Hive as a Data Warehouse  does what it says on the tin.

 

 

Thanks again

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 03 February 2016 09:25
To: user@hive.apache.org
Subject: RE: Hive on Spark Engine versus Spark using Hive metastore

 

Hi Jeff,

 

I only have a two node cluster. Is there anyway one can simulate additional 
parallel runs in such an environment thus having more than two maps?

 

thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publicat

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh
Hi Jeff,

 

I only have a two node cluster. Is there anyway one can simulate additional 
parallel runs in such an environment thus having more than two maps?

 

thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 03 February 2016 02:39
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

Yes, regardless what spark mode you're running in, from Spark AM webui, you 
should be able to see how many task are concurrently running. I'm a little 
surprised to see that your Hive configuration only allows 2 map tasks to run in 
parallel. If your cluster has the capacity, you should parallelize all the 
tasks to achieve optimal performance. Since I don't know your Spark SQL 
configuration, I cannot tell how much parallelism you have over there. Thus, 
I'm not sure if your comparison is valid.

--Xuefu

 

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh mailto:m...@peridale.co.uk> > wrote:

Hi Jeff,

 

In below

 

…. You should be able to see the resource usage in YARN resource manage URL.

 

Just to be clear we are talking about Port 8088/cluster?

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: 03 February 2016 00:09

To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh mailto:m...@per

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Jörn Franke
gt;>>>> 3 rows selected (76.835 seconds)
>>>>> 
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in 
>>>>> (1, 5, 10);
>>>>> 
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>> 
>>>>> +---+--+--+---+-+-++--+
>>>>> 
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |   
>>>>>   dummy.random_string | dummy.small_vc  | 
>>>>> dummy.padding  |
>>>>> 
>>>>> +---+--+--+---+-+-++--+
>>>>> 
>>>>> | 1 | 0| 0| 63| 
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
>>>>> xx |
>>>>> 
>>>>> | 5 | 0| 4| 31| 
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
>>>>> xx |
>>>>> 
>>>>> | 10| 99   | 999  | 188   | 
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000  | 
>>>>> xxxxxx |
>>>>> 
>>>>> +---+--+--+---+-+-++--+
>>>>> 
>>>>> 3 rows selected (80.718 seconds)
>>>>> 
>>>>>  
>>>>> 
>>>>> Three runs returning the same rows in 80 seconds.
>>>>> 
>>>>>  
>>>>> 
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of 
>>>>> date and that causes this lag.
>>>>> 
>>>>>  
>>>>> 
>>>>> There are certain queries that one cannot do with Spark. Besides it does 
>>>>> not recognize CHAR fields which is a pain.
>>>>> 
>>>>>  
>>>>> 
>>>>> spark-sql> CREATE TEMPORARY TABLE tmp AS
>>>>> 
>>>>>  > SELECT t.calendar_month_desc, c.channel_desc, 
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>> 
>>>>>  > FROM sales s, times t, channels c
>>>>> 
>>>>>  > WHERE s.time_id = t.time_id
>>>>> 
>>>>>  > AND   s.channel_id = c.channel_id
>>>>> 
>>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>> 
>>>>>  > ;
>>>>> 
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>> 
>>>>> .
>>>>> 
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>>  
>>>>> 
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> 
>>>>>  
>>>>> 
>>>>> Sybase ASE 15 Gold Medal Award 2008
>>>>> 
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>> 
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>> 
>>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 
>>>>> 15", ISBN 978-0-9563693-0-7.
>>>>> 
>>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>>>>> 978-0-9759693-0-4
>>>>> 
>>>>> Publications due shortly:
>>>>> 
>>>>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>>>>> 978-0-9563693-3-8
>>>>> 
>>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
>>>>> volume one out shortly
>>>>> 
>>>>>  
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>>> message is for the designated recipient only, if you are not the intended 
>>>>> recipient, you should destroy it immediately. Any information in this 
>>>>> message shall not be understood as given or endorsed by Peridale 
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>>> subsidiaries nor their employees accept any responsibility.
>>>>> 
>>>>>  
>>>>> 
>>>>> From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
>>>>> Sent: 02 February 2016 23:12
>>>>> To: user@hive.apache.org
>>>>> Subject: Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>> 
>>>>>  
>>>>> 
>>>>> I think the diff is not only about which does optimization but more on 
>>>>> feature parity. Hive on Spark offers all functional features that Hive 
>>>>> offers and these features play out faster. However, Spark SQL is far from 
>>>>> offering this parity as far as I know.
>>>>> 
>>>>>  
>>>>> 
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh  
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>  
>>>>> 
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive 
>>>>> optimizer and Spark query engine
>>>>> 
>>>>>  
>>>>> 
>>>>> With spark using Hive metastore, Spark does both the optimization and 
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive tables from spark-sql etc
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Is this assessment correct?
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>  
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>>  
>>>>> 
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> 
>>>>>  
>>>>> 
>>>>> Sybase ASE 15 Gold Medal Award 2008
>>>>> 
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>> 
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>> 
>>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 
>>>>> 15", ISBN 978-0-9563693-0-7.
>>>>> 
>>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>>>>> 978-0-9759693-0-4
>>>>> 
>>>>> Publications due shortly:
>>>>> 
>>>>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>>>>> 978-0-9563693-3-8
>>>>> 
>>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
>>>>> volume one out shortly
>>>>> 
>>>>>  
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>>> message is for the designated recipient only, if you are not the intended 
>>>>> recipient, you should destroy it immediately. Any information in this 
>>>>> message shall not be understood as given or endorsed by Peridale 
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>>> subsidiaries nor their employees accept any responsibility.
>>>>> 
>> 
>> 
>> -- 
>> Sorry this was sent from mobile. Will do less grammar and spell check than 
>> usual.
> 


RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Ryan Harris
https://github.com/myui/hivemall

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO.

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

yeah but have you ever seen somewhat write a real analytical program in hive? 
how? where are the basic abstractions to wrap up a large amount of operations 
(joins, groupby's) into a single function call? where are the tools to write 
nice unit test for that?
for example in spark i can write a DataFrame => DataFrame that internally does 
many joins, groupBys and complex operations. all unit tested and perfectly 
re-usable. and in hive? copy paste round sql queries? thats just dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
mailto:edlinuxg...@gmail.com>> wrote:
Hive has numerous extension points, you are not boxed in by a long shot.


On Tuesday, February 2, 2016, Koert Kuipers 
mailto:ko...@tresata.com>> wrote:
uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.
--Xuefu

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh  wrote:
Thanks Jeff.

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

This may be:


1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

Spark 1.5.2 on Hive 1.2.1 Metastore


spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.805 seconds, Fetched 3 row(s)
spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.358 seconds, Fetched 3 row(s)
spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.563 seconds, Fetched 3 row(s)

So three runs returning three rows just over 50 seconds

Hive 1.2.1 on spark 1.3.1 execution engine

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);
INFO  :
Query Hive on Spark job[4] stages:
INFO  : 4
INFO  :
Status: Running (Hive on Spark job[4])
INFO  : Status: Finished successfully in 82.49 seconds
+---+--+--+---+-+-++--+
| dummy.id<http://dummy.id>  | dummy.clustered  | dummy.scattered  | 
dummy.randomised  | 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
lZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (76.835 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 10);
>>>>>
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> | dummy.random_string | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | 1 | 0| 0| 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>> xx |
>>>>>
>>>>> | 5 | 0| 4| 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>> xx |
>>>>>
>>>>> | 10| 99   | 999  | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (80.718 seconds)
>>>>>
>>>>>
>>>>>
>>>>> Three runs returning the same rows in 80 seconds.
>>>>>
>>>>>
>>>>>
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>>> date and that causes this lag.
>>>>>
>>>>>
>>>>>
>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>> does not recognize CHAR fields which is a pain.
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>
>>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>
>>>>>  > FROM sales s, times t, channels c
>>>>>
>>>>>  > WHERE s.time_id = t.time_id
>>>>>
>>>>>  > AND   s.channel_id = c.channel_id
>>>>>
>>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>
>>>>>  > ;
>>>>>
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>>
>>>>> .
>>>>>
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>>> *Sent:* 02 February 2016 23:12
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>>
>>>>>
>>>>>
>>>>> I think the diff is not only about which does optimization but more on
>>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>>> offers and these features play out faster. However, Spark SQL is far from
>>>>> offering this parity as far as I know.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh 
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>>> optimizer and Spark query engine
>>>>>
>>>>>
>>>>>
>>>>> With spark using Hive metastore, Spark does both the optimization and
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive
>>>>> tables from spark-sql etc
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is this assessment correct?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
| 1 | 0| 0| 63|
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>> xx |
>>>>
>>>> | 5 | 0| 4| 31|
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>> xx |
>>>>
>>>> | 10| 99   | 999  | 188   |
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>> xx |
>>>>
>>>>
>>>> +---+--+--+---+-+-++--+
>>>>
>>>> 3 rows selected (80.718 seconds)
>>>>
>>>>
>>>>
>>>> Three runs returning the same rows in 80 seconds.
>>>>
>>>>
>>>>
>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>> date and that causes this lag.
>>>>
>>>>
>>>>
>>>> There are certain queries that one cannot do with Spark. Besides it
>>>> does not recognize CHAR fields which is a pain.
>>>>
>>>>
>>>>
>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>
>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>> SUM(s.amount_sold) AS TotalSales
>>>>
>>>>  > FROM sales s, times t, channels c
>>>>
>>>>  > WHERE s.time_id = t.time_id
>>>>
>>>>  > AND   s.channel_id = c.channel_id
>>>>
>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>
>>>>  > ;
>>>>
>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>
>>>> .
>>>>
>>>> You are likely trying to use an unsupported Hive feature.";
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>> volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>> nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>> *Sent:* 02 February 2016 23:12
>>>> *To:* user@hive.apache.org
>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>
>>>>
>>>>
>>>> I think the diff is not only about which does optimization but more on
>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>> offers and these features play out faster. However, Spark SQL is far from
>>>> offering this parity as far as I know.
>>>>
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>> optimizer and Spark query engine
>>>>
>>>>
>>>>
>>>> With spark using Hive metastore, Spark does both the optimization and
>>>> query engine. The only value add is that one can access the underlying Hive
>>>> tables from spark-sql etc
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Is this assessment correct?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>> volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>> nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yes. the ability to start with sql but when needed expand into more full
blown programming languages, machine learning etc. is a huge plus. after
all this is a cluster, and just querying or extracting data to move it off
the cluster into some other analytics tool is going to be very inefficient
and defeats the purpose to some extend of having a cluster. so you want to
have a capability to do more than queries and etl. and spark is that
ticket. hive is simply not. well not for anything somewhat complex anyhow.


On Tue, Feb 2, 2016 at 8:06 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Are you referring to spark-shell with Scala, Python and others?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrK

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
Yes, regardless what spark mode you're running in, from Spark AM webui, you
should be able to see how many task are concurrently running. I'm a little
surprised to see that your Hive configuration only allows 2 map tasks to
run in parallel. If your cluster has the capacity, you should parallelize
all the tasks to achieve optimal performance. Since I don't know your Spark
SQL configuration, I cannot tell how much parallelism you have over there.
Thus, I'm not sure if your comparison is valid.

--Xuefu

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh  wrote:

> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjF

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Edward Capriolo
>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>
>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>> SUM(s.amount_sold) AS TotalSales
>>>
>>>  > FROM sales s, times t, channels c
>>>
>>>  > WHERE s.time_id = t.time_id
>>>
>>>  > AND   s.channel_id = c.channel_id
>>>
>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>
>>>  > ;
>>>
>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>
>>> .
>>>
>>> You are likely trying to use an unsupported Hive feature.";
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com
>>> ]
>>> *Sent:* 02 February 2016 23:12
>>> *To:* user@hive.apache.org
>>> 
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> I think the diff is not only about which does optimization but more on
>>> feature parity. Hive on Spark offers all functional features that Hive
>>> offers and these features play out faster. However, Spark SQL is far from
>>> offering this parity as far as I know.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh >> > wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>> optimizer and Spark query engine
>>>
>>>
>>>
>>> With spark using Hive metastore, Spark does both the optimization and
>>> query engine. The only value add is that one can access the underlying Hive
>>> tables from spark-sql etc
>>>
>>>
>>>
>>>
>>>
>>> Is this assessment correct?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>>
>>>
>>
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi Jeff,

 

In below

 

…. You should be able to see the resource usage in YARN resource manage URL.

 

Just to be clear we are talking about Port 8088/cluster?

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: 03 February 2016 00:09
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh mailto:m...@peridale.co.uk> > wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+---+--+--+---+-+--

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi,

 

Are you referring to spark-shell with Scala, Python and others? 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: 03 February 2016 00:09
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh mailto:m...@peridale.co.uk> > wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+---+--+--+---+-+-++--+

| dummy.id <http://d

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
|
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (82.66 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  : Status: Finished successfully in 76.67 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | 1 | 0| 0| 63|
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>> xx |
>>
>> | 5 | 0| 4| 31|
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>> xx |
>>
>> | 10| 99   | 999  | 188   |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (76.835 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  : Status: Finished successfully in 80.54 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | 1 | 0| 0| 63|
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>> xx |
>>
>> | 5 | 0| 4| 31|
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>> xx |
>>
>> | 10| 99   | 999  | 188   |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (80.718 seconds)
>>
>>
>>
>> Three runs returning the same rows in 80 seconds.
>>
>>
>>
>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>> date and that causes this lag.
>>
>>
>>
>> There are certain queries that one cannot do with Spark. Besides it does
>> not recognize CHAR fields which is a pain.
>>
>>
>>
>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>
>>  > SELECT t.calendar_month_desc, c.channel_desc,
>> SUM(s.amount_sold) AS TotalSales
>>
>>  > FROM sales s, times t, channels c
>>
>>  > WHERE s.time_id = t.time_id
>>
>>  > AND   s.channel_id = c.channel_id
>>
>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>
>>  > ;
>>
>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>
>> .
>>
>> You are likely trying to use an unsupported Hive feature.";
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybas

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
ing | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (76.835 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 10);
>
> INFO  : Status: Finished successfully in 80.54 seconds
>
>
> +---+--+--+---+-+-++--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> | dummy.random_string | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (80.718 seconds)
>
>
>
> Three runs returning the same rows in 80 seconds.
>
>
>
> It is possible that My Spark engine with Hive is 1.3.1 which is out of
> date and that causes this lag.
>
>
>
> There are certain queries that one cannot do with Spark. Besides it does
> not recognize CHAR fields which is a pain.
>
>
>
> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>
>  > SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>
> .
>
> You are likely trying to use an unsupported Hive feature.";
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employee

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
y in 80.54 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
xx |

| 10| 99   | 999  | 188   | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  | 
xx |

+---+--+--+---+-+-++--+

3 rows selected (80.718 seconds)

 

Three runs returning the same rows in 80 seconds. 

 

It is possible that My Spark engine with Hive is 1.3.1 which is out of date and 
that causes this lag. 

 

There are certain queries that one cannot do with Spark. Besides it does not 
recognize CHAR fields which is a pain.

 

spark-sql> CREATE TEMPORARY TABLE tmp AS

 > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
TotalSales

 > FROM sales s, times t, channels c

 > WHERE s.time_id = t.time_id

 > AND   s.channel_id = c.channel_id

 > GROUP BY t.calendar_month_desc, c.channel_desc

 > ;

Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7

.

You are likely trying to use an unsupported Hive feature.";

 

 

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 02 February 2016 23:12
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

I think the diff is not only about which does optimization but more on feature 
parity. Hive on Spark offers all functional features that Hive offers and these 
features play out faster. However, Spark SQL is far from offering this parity 
as far as I know.

 

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh mailto:m...@peridale.co.uk> > wrote:

Hi,

 

My understanding is that with Hive on Spark engine, one gets the Hive optimizer 
and Spark query engine

 

With spark using Hive metastore, Spark does both the optimization and query 
engine. The only value add is that one can access the underlying Hive tables 
from spark-sql etc

 

 

Is this assessment correct?

 

 

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprie

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Philip Lee
>From my experience, spark sql has its own optimizer to support Hive query
and metastore. After 1.5.2 spark, its optimizer is named catalyst.
2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성:

> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers all functional features that Hive
> offers and these features play out faster. However, Spark SQL is far from
> offering this parity as far as I know.
>
> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>>
>>
>> My understanding is that with Hive on Spark engine, one gets the Hive
>> optimizer and Spark query engine
>>
>>
>>
>> With spark using Hive metastore, Spark does both the optimization and
>> query engine. The only value add is that one can access the underlying Hive
>> tables from spark-sql etc
>>
>>
>>
>>
>>
>> Is this assessment correct?
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
I think the diff is not only about which does optimization but more on
feature parity. Hive on Spark offers all functional features that Hive
offers and these features play out faster. However, Spark SQL is far from
offering this parity as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>