subject:"Re\: Hive on Spark"

Re: hive with spark backend engine

2022-06-27 Thread Rob Verkuylen

Hive On Spark has never been a one on one replacement for Hive on MR. Hive
on Spark has/had several issues and could not fully cover all use-cases.
Therefore Hive On Spark has pretty much been deprecated in favour of Hive
on Tez. It offers the scalability advantages of Hive on MR with much of the
performance that Spark could bring, but with a much better integration for
query planner optimization.

On Tue, Jun 21, 2022 at 12:52 PM Yong Walt  wrote:

> we have been running hive3 with tez engine.
>
> On Tue, Jun 21, 2022 at 9:19 AM second_co...@yahoo.com <
> second_co...@yahoo.com> wrote:
>
>> Hello team,
>>
>> The default Hive is using Hadoop map reduce. May i know anyone
>> successfully swap the engine by running spark operator/cluster? Any guide
>> or example on this?
>>
>> Thank you,
>> Teoh
>>
>>
>>
>>

Re: hive with spark backend engine

2022-06-21 Thread Yong Walt

we have been running hive3 with tez engine.

On Tue, Jun 21, 2022 at 9:19 AM second_co...@yahoo.com <
second_co...@yahoo.com> wrote:

> Hello team,
>
> The default Hive is using Hadoop map reduce. May i know anyone
> successfully swap the engine by running spark operator/cluster? Any guide
> or example on this?
>
> Thank you,
> Teoh
>
>
>
>

RE: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob

Thank you so much Mich! Although a bit older, this is the most detailed 
comparison I’ve read on the subject. Thanks again.

Regards,
-Manu

From: Mich Talebzadeh 
Sent: Tuesday, October 06, 2020 12:37 PM
To: user 
Subject: Re: Hive using Spark engine vs native spark with hive integration.

EXTERNAL
Hi Manu,

In the past (July 2016), I made a presentation organised by then Hortonworks in 
London titled "Query Engines for Hive: MR, Spark, Tez with LLAP – 
Considerations! "

The PDF presentation is 
here<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftalebzadehmich.files.wordpress.com%2F2016%2F08%2Fhive_on_spark_only.pdf=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=8sYj7ps6GdC1QWAqaQdbIdd9c5PqCZ0IkRwvalLpYe8%3D=0>.
 With a caveat that was more than 4 years ago!

However, as of today I would recommend writing the code in Spark with Scala and 
running against Spark. You can try it using spark-shell to start with.

If you are reading from Hive table or any other source like CSV etc, there are 
plenty of examples in Spark web 
https://spark.apache.org/examples.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fexamples.html=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=lQWc7VLCia7VyhLohawAaStXnYX1ShbN%2FmU5kAjfaBQ%3D=0>

Also I suggest that you use Scala as Spark itself is written in Scala (though 
Python is more popular with Data Science guys).

HTH

[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fprofile%2Fview%3Fid%3DAAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962415176=QB0525D6xXin7RdcFYdkOAWKARki6uFBq2GQcdNJ0dc%3D=0>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On Tue, 6 Oct 2020 at 16:47, Manu Jacob 
mailto:manu.ja...@sas.com>> wrote:
Hi All,

Not sure if I need to ask this question on hive community or spark community.

We have a set of hive scripts that runs on EMR (Tez engine). We would like to 
experiment by moving some of it onto Spark. We are planning to experiment with 
two options.

  1.  Use the current code based on HQL, with engine set as spark.
  2.  Write pure spark code in scala/python using SparkQL and hive integration.

The first approach helps us to transition to Spark quickly but not sure if this 
is the best approach in terms of performance.  Could not find any reasonable 
comparisons of this two approaches.  It looks like writing pure Spark code, 
gives us more control to add logic and also control some of the performance 
features, for example things like caching/evicting etc.

Any advise on this is much appreciated.

Thanks,
-Manu

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Mich Talebzadeh

Hi Manu,

In the past (July 2016), I made a presentation organised by then
Hortonworks in London titled "Query Engines for Hive: MR, Spark, Tez with
LLAP – Considerations! "

The PDF presentation is here
.
With a caveat that was more than 4 years ago!

However, as of today I would recommend writing the code in Spark with Scala
and running against Spark. You can try it using spark-shell to start with.

If you are reading from Hive table or any other source like CSV etc, there
are plenty of examples in Spark web https://spark.apache.org/examples.html

Also I suggest that you use Scala as Spark itself is written in Scala
(though Python is more popular with Data Science guys).

HTH

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 6 Oct 2020 at 16:47, Manu Jacob  wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on hive community or spark
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advise on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread 刘虓

hi,
if you are already running hive with tez,the perf gain won't be obvious
camparing with spark.
I'd recommend experimenting with spark on something new until a better
understanding is formed

Manu Jacob 于2020年10月6日 周二23:47写道：

> Hi All,
>
>
>
> Not sure if I need to ask this question on hive community or spark
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advise on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>

Re: hive on spark - why is it so hard?

2017-10-02 Thread Jörn Franke

You should try with TEZ+LLAP.

Additionally you will need to compare different configurations.

Finally just any comparison is meaningless.
You should use queries, data and file formats that your users are using later.

> On 2. Oct 2017, at 03:06, Stephen Sprague  wrote:
> 
> so...  i made some progress after much copying of jar files around (as 
> alluded to by Gopal previously on this thread).
> 
> 
> following the instructions here: 
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> 
> and doing this as instructed will leave off about a dozen or so jar files 
> that spark'll need:
>   ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz 
> "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
> 
> i ended copying the missing jars to $SPARK_HOME/jars but i would have 
> preferred to just add a path(s) to the spark class path but i did not find 
> any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but 
> i don't see the analagous var in spark - i don't think it inherits the hive 
> classpath.
> 
> anyway a simple query is now working under Hive On Spark so i think i might 
> be over the hump.  Now its a matter of comparing the performance with Tez.
> 
> Cheers,
> Stephen.
> 
> 
>> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:
>> ok.. getting further.  seems now i have to deploy hive to all nodes in the 
>> cluster - don't think i had to do that before but not a big deal to do it 
>> now.
>> 
>> for me:
>> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
>> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>> 
>> on all three nodes now.
>> 
>> i started spark master on the namenode and i started spark slaves (2) on two 
>> datanodes of the cluster. 
>> 
>> so far so good.
>> 
>> now i run my usual test command.
>> 
>> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set 
>> hive.execution.engine=spark; select date_key, count(*) from 
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>> 
>> i get a little further now and find the stderr from the Spark Web UI 
>> interface (nice) and it reports this:
>> 
>> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
>> spark://Worker@172.19.79.127:40145
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:483)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
>>  at 
>> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>>  at 
>> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>>  at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>>  ... 6 more
>> 
>> 
>> searching around the internet i find this is probably a compatibility issue.
>> 
>> i know. i know. no surprise here.  
>> 
>> so i guess i just got to the point where everybody else is... build spark 
>> w/o hive. 
>> 
>> lemme see what happens next.
>> 
>> 
>> 
>> 
>> 
>>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:
>>> thanks.  I haven't had a chance to dig into this again today but i do 
>>> appreciate the pointer.  I'll keep you posted.
>>> 
 On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar  
 wrote:
 You can try increasing the value of hive.spark.client.connect.timeout. 
 Would also suggest taking a look at the HoS Remote Driver logs. The driver 
 gets launched in a YARN container (assuming you are running Spark in 
 yarn-client mode), so you just have to find the logs for that container.
 
 --Sahil
 
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  
> wrote:
> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's 
> where i'm at now.
> 
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: 
> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "action" : "CreateSubmissionResponse",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "message" : "Driver successfully submitted as driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "serverSparkVersion" : "2.2.0",
>

Re: hive on spark - why is it so hard?

2017-10-01 Thread Stephen Sprague

so...  i made some progress after much copying of jar files around (as
alluded to by Gopal previously on this thread).


following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

and doing this as instructed will leave off about a dozen or so jar files
that spark'll need:
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

i ended copying the missing jars to $SPARK_HOME/jars but i would have
preferred to just add a path(s) to the spark class path but i did not find
any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH
but i don't see the analagous var in spark - i don't think it inherits the
hive classpath.

anyway a simple query is now working under Hive On Spark so i think i might
be over the hump.  Now its a matter of comparing the performance with Tez.

Cheers,
Stephen.


On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:

> ok.. getting further.  seems now i have to deploy hive to all nodes in the
> cluster - don't think i had to do that before but not a big deal to do it
> now.
>
> for me:
> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>
> on all three nodes now.
>
> i started spark master on the namenode and i started spark slaves (2) on
> two datanodes of the cluster.
>
> so far so good.
>
> now i run my usual test command.
>
> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
> i get a little further now and find the stderr from the Spark Web UI
> interface (nice) and it reports this:
>
> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
> spark://Worker@172.19.79.127:40145
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused 
> by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
>   at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>   at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>   at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>   ... 6 more
>
>
>
> searching around the internet i find this is probably a compatibility
> issue.
>
> i know. i know. no surprise here.
>
> so i guess i just got to the point where everybody else is... build spark
> w/o hive.
>
> lemme see what happens next.
>
>
>
>
>
> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague 
> wrote:
>
>> thanks.  I haven't had a chance to dig into this again today but i do
>> appreciate the pointer.  I'll keep you posted.
>>
>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
>> wrote:
>>
>>> You can try increasing the value of hive.spark.client.connect.timeout.
>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>>> gets launched in a YARN container (assuming you are running Spark in
>>> yarn-client mode), so you just have to find the logs for that container.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>>> wrote:
>>>
 i _seem_ to be getting closer.  Maybe its just wishful thinking.
 Here's where i'm at now.

 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
 CreateSubmissionResponse:
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "action" : "CreateSubmissionResponse",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "message" : "Driver successfully submitted as 
 driver-20170926211038-0003",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "serverSparkVersion" : "2.2.0",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "submissionId" : "driver-20170926211038-0003",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "success" : true
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
 dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
 Client

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague

ok.. getting further.  seems now i have to deploy hive to all nodes in the
cluster - don't think i had to do that before but not a big deal to do it
now.

for me:
HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6

on all three nodes now.

i started spark master on the namenode and i started spark slaves (2) on
two datanodes of the cluster.

so far so good.

now i run my usual test command.

$ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'

i get a little further now and find the stderr from the Spark Web UI
interface (nice) and it reports this:

17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to
spark://Worker@172.19.79.127:40145
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused
by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
at 
org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
... 6 more

searching around the internet i find this is probably a compatibility issue.

i know. i know. no surprise here.

so i guess i just got to the point where everybody else is... build spark
w/o hive.

lemme see what happens next.

On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:

> thanks.  I haven't had a chance to dig into this again today but i do
> appreciate the pointer.  I'll keep you posted.
>
> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
> wrote:
>
>> You can try increasing the value of hive.spark.client.connect.timeout.
>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>> gets launched in a YARN container (assuming you are running Spark in
>> yarn-client mode), so you just have to find the logs for that container.
>>
>> --Sahil
>>
>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>> wrote:
>>
>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>> Here's where i'm at now.
>>>
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>> CreateSubmissionResponse:
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "action" : "CreateSubmissionResponse",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "serverSparkVersion" : "2.2.0",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "submissionId" : "driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "success" : true
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>>> 9.73.136:8020 from dwr: closed
>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Clien
>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>>> from dwr: stopped, remaining connections 0
>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>>> *Possible reasons include network issues, errors in remote driver or the
>>> cluster has no available resources, etc.*
>>> *Please check YARN or Spark driver's logs for further information.*
>>> java.util.concurrent.ExecutionException: 
>>> java.util.concurrent.TimeoutException:
>>> Timed out waiting for client connection.
>>> at 
>>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>>> at 
>>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at 
>>>

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague

thanks.  I haven't had a chance to dig into this again today but i do
appreciate the pointer.  I'll keep you posted.

On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
wrote:

> You can try increasing the value of hive.spark.client.connect.timeout.
> Would also suggest taking a look at the HoS Remote Driver logs. The driver
> gets launched in a YARN container (assuming you are running Spark in
> yarn-client mode), so you just have to find the logs for that container.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
> wrote:
>
>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
>> where i'm at now.
>>
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>> CreateSubmissionResponse:
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "action" : "CreateSubmissionResponse",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "serverSparkVersion" : "2.2.0",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "submissionId" : "driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "success" : true
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>> 9.73.136:8020 from dwr: closed
>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Clien
>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>> from dwr: stopped, remaining connections 0
>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>> *Possible reasons include network issues, errors in remote driver or the
>> cluster has no available resources, etc.*
>> *Please check YARN or Spark driver's logs for further information.*
>> java.util.concurrent.ExecutionException: 
>> java.util.concurrent.TimeoutException:
>> Timed out waiting for client connection.
>> at 
>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>> [hive-exec-2.3.0.jar:2.3.0]
>>
>>
>> i'll dig some more tomorrow.
>>
>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
>> wrote:
>>
>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>> you posted on my progress.
>>>
>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >> > wrote:
>>>
 Hi,

 > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
 spark session: org.apache.hadoop.hive.ql.metadata.HiveException:
 Failed to create spark client.

 I get inexplicable errors with Hive-on-Spark unless I do a three step
 build.

 Build Hive first, use that version to build Spark, use that Spark
 version to rebuild Hive.

 I have to do this to make it work because Spark contains Hive jars and
 Hive contains Spark jars in the class-path.

 And specifically I have to edit the pom.xml files,

Re: hive on spark - why is it so hard?

2017-09-27 Thread Sahil Takiar

You can try increasing the value of hive.spark.client.connect.timeout.
Would also suggest taking a look at the HoS Remote Driver logs. The driver
gets launched in a YARN container (assuming you are running Spark in
yarn-client mode), so you just have to find the logs for that container.

--Sahil

On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  wrote:

> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
> where i'm at now.
>
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
> CreateSubmissionResponse:
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "action" : "CreateSubmissionResponse",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "message" : "Driver successfully submitted as driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "serverSparkVersion" : "2.2.0",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "submissionId" : "driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "success" : true
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.
> 19.73.136:8020 from dwr: closed
> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
> Clien
> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
> from dwr: stopped, remaining connections 0
> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
> client.SparkClientImpl: Timed out waiting for client to connect.
> *Possible reasons include network issues, errors in remote driver or the
> cluster has no available resources, etc.*
> *Please check YARN or Spark driver's logs for further information.*
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException:
> Timed out waiting for client connection.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
> at 
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
> [hive-exec-2.3.0.jar:2.3.0]
> at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.
> createRemoteClient(RemoteHiveSparkClient.java:101)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.
> RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
> createHiveSparkClient(HiveSparkClientFactory.java:73)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.session.
> SparkSessionImpl.open(SparkSessionImpl.java:62)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.session.
> SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.
> getSparkSession(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.getSparkMemoryAndCores(
> SetSparkReducerParallelism.java:236) [hive-exec-2.3.0.jar:2.3.0]
>
>
> i'll dig some more tomorrow.
>
> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
> wrote:
>
>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
>> posted on my progress.
>>
>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
>> wrote:
>>
>>> Hi,
>>>
>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>>> to create spark client.
>>>
>>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>>> build.
>>>
>>> Build Hive first, use that version to build Spark, use that Spark
>>> version to rebuild Hive.
>>>
>>> I have to do this to make it work because Spark contains Hive jars and
>>> Hive contains Spark jars in the class-path.
>>>
>>> And specifically I have to edit the pom.xml files, instead of passing in
>>> params with -Dspark.version, because the installed pom files don't get
>>> replacements from the build args.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
where i'm at now.

2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
CreateSubmissionResponse:
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"action" : "CreateSubmissionResponse",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"message" : "Driver successfully submitted as driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"serverSparkVersion" : "2.2.0",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"submissionId" : "driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"success" : true
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: closed
2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Clien
t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: stopped, remaining connections 0
2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
client.SparkClientImpl: Timed out waiting for client to connect.
*Possible reasons include network issues, errors in remote driver or the
cluster has no available resources, etc.*
*Please check YARN or Spark driver's logs for further information.*
java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Timed out waiting for client
connection.
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
~[netty-all-4.0.29.Final.jar:4.0.29.Final]
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
[hive-exec-2.3.0.jar:2.3.0]

i'll dig some more tomorrow.

On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague  wrote:

> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
> posted on my progress.
>
> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
> wrote:
>
>> Hi,
>>
>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>> to create spark client.
>>
>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>> build.
>>
>> Build Hive first, use that version to build Spark, use that Spark version
>> to rebuild Hive.
>>
>> I have to do this to make it work because Spark contains Hive jars and
>> Hive contains Spark jars in the class-path.
>>
>> And specifically I have to edit the pom.xml files, instead of passing in
>> params with -Dspark.version, because the installed pom files don't get
>> replacements from the build args.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
posted on my progress.

On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
> to create spark client.
>
> I get inexplicable errors with Hive-on-Spark unless I do a three step
> build.
>
> Build Hive first, use that version to build Spark, use that Spark version
> to rebuild Hive.
>
> I have to do this to make it work because Spark contains Hive jars and
> Hive contains Spark jars in the class-path.
>
> And specifically I have to edit the pom.xml files, instead of passing in
> params with -Dspark.version, because the installed pom files don't get
> replacements from the build args.
>
> Cheers,
> Gopal
>
>
>

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

well this is the spark-submit line from above:

   2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
main] client.SparkClientImpl: Running client driver with argv:
*/usr/li/spark-2.2.0-bin-**hadoop2.6/bin/spark-submit*

and that's pretty clearly v2.2


I do have other versions of spark on the namenode so lemme remove those and
see what happens


A-HA! dang it!

$ echo $SPARK_HOME
/usr/local/spark

well that clearly needs to be: */usr/lib/spark-2.2.0-bin-*
*hadoop2.6  *

how did i miss that? unbelievable.


Thank you Sahil!   Lets see what happens next!

Cheers,
Stephen


On Tue, Sep 26, 2017 at 4:12 PM, Sahil Takiar 
wrote:

> Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks
> like your call to spark-submit it using an older version of Spark (looks
> like some early 1.x version). Do you have SPARK_HOME set locally? Do you
> have older versions of Spark installed locally?
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague 
> wrote:
>
>> thanks Sahil.  here it is.
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:344)
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.
>> scala:318)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 75)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.scheduler.SparkListenerInterface
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>>
>> at 
>> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
>> FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] ql.Driver: FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
>> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to
>> create spark client.
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.process(SetSparkReducerParallelism.java:173)
>> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch
>> (DefaultRuleDispatcher.java:90)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAnd
>> Return(DefaultGraphWalker.java:105)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(De
>> faultGraphWalker.java:89)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:56)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalkin
>> g(DefaultGraphWalker.java:120)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetRe
>> ducerParallelism(SparkCompiler.java:288)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimize
>> OperatorPlan(SparkCompiler.java:122)
>> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCom
>> piler.java:140)
>> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInte
>> rnal(SemanticAnalyzer.java:11253)
>> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeIntern
>> al(CalcitePlanner.java:286)
>> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze
>> (BaseSemanticAnalyzer.java:258)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
>> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java
>> :1316)
>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
>> at

Re: hive on spark - why is it so hard?

2017-09-26 Thread Gopal Vijayaraghavan

Hi,

> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark 
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create 
> spark client.
 
I get inexplicable errors with Hive-on-Spark unless I do a three step build.

Build Hive first, use that version to build Spark, use that Spark version to 
rebuild Hive.

I have to do this to make it work because Spark contains Hive jars and Hive 
contains Spark jars in the class-path.

And specifically I have to edit the pom.xml files, instead of passing in params 
with -Dspark.version, because the installed pom files don't get replacements 
from the build args.

Cheers,
Gopal

Re: hive on spark - why is it so hard?

2017-09-26 Thread Sahil Takiar

Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks
like your call to spark-submit it using an older version of Spark (looks
like some early 1.x version). Do you have SPARK_HOME set locally? Do you
have older versions of Spark installed locally?

--Sahil

On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague  wrote:

> thanks Sahil.  here it is.
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/scheduler/SparkListenerInterface
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:344)
> at org.apache.spark.deploy.SparkSubmit$.launch(
> SparkSubmit.scala:318)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.
> SparkListenerInterface
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
>
> at 
> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
> ~[hive-exec-2.3.0.jar:2.3.0]
> at 
> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
> ~[hive-exec-2.3.0.jar:2.3.0]
> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
> FAILED: SemanticException Failed to get a spark session:
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
> ql.Driver: FAILED: SemanticException Failed to get a spark session:
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to
> create spark client.
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.getSparkMemoryAndCores(
> SetSparkReducerParallelism.java:240)
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(
> DefaultRuleDispatcher.java:90)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.
> dispatchAndReturn(DefaultGraphWalker.java:105)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(
> DefaultGraphWalker.java:89)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:56)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:61)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:61)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:61)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(
> DefaultGraphWalker.java:120)
> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.
> runSetReducerParallelism(SparkCompiler.java:288)
> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.
> optimizeOperatorPlan(SparkCompiler.java:122)
> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(
> TaskCompiler.java:140)
> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.
> analyzeInternal(SemanticAnalyzer.java:11253)
> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(
> CalcitePlanner.java:286)
> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.
> analyze(BaseSemanticAnalyzer.java:258)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.
> java:1316)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
> CliDriver.java:233)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(
> CliDriver.java:184)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(
> CliDriver.java:403)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(
> CliDriver.java:336)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
> CliDriver.java:787)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

thanks Sahil.  here it is.

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/scheduler/SparkListenerInterface
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.SparkListenerInterface
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more

at
org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
~[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
~[hive-exec-2.3.0.jar:2.3.0]
at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
client.
2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
ql.Driver: FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
client.
org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create
spark client.
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
at
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:56)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetReducerParallelism(SparkCompiler.java:288)
at
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:122)
at
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:140)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11253)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1316)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I bugs me that that class is in spark-core_2.11-2.2.0.jar yet so seemingly
out of reach. :(

On Tue, Sep 26, 2017 at 2:44 PM, Sahil Takiar 
wrote:

> Hey Stephen,
>
> Can you send the full stack

Re: hive on spark - why is it so hard?

2017-09-26 Thread Sahil Takiar

Hey Stephen,

Can you send the full stack trace for the NoClassDefFoundError? For Hive
2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
of Spark, but we only test with Spark 2.0.0.

--Sahil

On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague  wrote:

> * i've installed hive 2.3 and spark 2.2
>
> * i've read this doc plenty of times -> https://cwiki.apache.org/
> confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> * i run this query:
>
>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
>
> * i get this error:
>
> *   Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/scheduler/SparkListenerInterface*
>
>
> * this class in:
>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>
> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>
> * i have updated hive-site.xml to set spark.yarn.jars to it.
>
> * i see this is the console:
>
> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main]
> spark.HiveSparkClientFactory: load spark property from hive configuration
> (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*
> ).
>
> * i see this on the console
>
> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
> client.SparkClientImpl: Running client driver with argv:
> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
> /tmp/spark-submit.6105784757200912217.properties --class
> org.apache.hive.spark.client.RemoteDriver 
> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
> hive.spark.client.connect.timeout=1000 --conf 
> hive.spark.client.server.connect.timeout=9
> --conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
> --conf hive.spark.client.rpc.server.address=null
>
> * i even print out CLASSPATH in this script: /usr/lib/spark-2.2.0-bin-
> hadoop2.6/bin/spark-submit
>
> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is
> in it.
>
> so i ask... what am i missing?
>
> thanks,
> Stephen
>
>
>
>
>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309

Re: Hive on Spark

2017-08-22 Thread Vihang Karajgaonkar

Xuefu is planning to give a talk on Hive-on-Spark @Uber the user meetup
this week. We can check if can share the presentation on this list for
folks who can't attend the meetup.

https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/

On Mon, Aug 21, 2017 at 11:44 PM, peter zhang 
wrote:

> Hi All,
> Has anybody used hive on spark in your production environment? How
> does it's the stability and performance compared with spark sql?
> Hope anybody can share your experience.
>
> Thanks in advance!
>

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague

yeah but... is the glass half-full or half-empty?  sure this might suck but
keep your head high, bro! Lots of it (hive) does work. :)


On Fri, Mar 17, 2017 at 2:25 PM, hernan saab 
wrote:

> Stephan,
>
> Thanks for the response.
>
> The one thing that I don't appreciate from those who promote and DOCUMENT
> spark on hive is that, seemingly, there is absolutely no evidence seen that
> says that hive on spark WORKS.
> As a matter of fact, after a lot of pain, I noticed it is not supported by
> just about anybody.
>
> If someone dares to document Hive on Spark (see link
> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started)  why can't they have the decency to mention
> what specific combo of Hadoop/Spark/Hive versions used that works? Have a
> git repo included in a doc with all the right versions and libraries. Why
> not? We can start from there and progressively use newer libraries in case
> the doc becomes stale. I am not really asking much, I just want to know
> what the documenter used to claim that Hive on Spark works, that's it.
>
> Clearly, for most cases, this setup is broken and it misleads people to
> waste time on a broken setup.
>
> I love this tech. But I do notice that there is some mean spirited or very
> negligent actions made by the apache development community. Documenting
> hive on spark while knowing it won't work for most cases means apache
> developers don't give a crap about the time wasted by people like us.
>
>
>
>
> On Friday, March 17, 2017 1:14 PM, Edward Capriolo 
> wrote:
>
>
>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  > wrote:
>
> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> 
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> 
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
> 
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>
>
>
>
> Stephan,
>
> I understand some of your frustration.  Remember that many in open source
> are volunteering their time. This is why if you pay a vendor for support of

Re: hive on spark - version question

2017-03-17 Thread hernan saab

Stephan,
Thanks for the response.
The one thing that I don't appreciate from those who promote and DOCUMENT spark 
on hive is that, seemingly, there is absolutely no evidence seen that says that 
hive on spark WORKS. As a matter of fact, after a lot of pain, I noticed it is 
not supported by just about anybody.
If someone dares to document Hive on Spark (see link 
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
  why can't they have the decency to mention what specific combo of 
Hadoop/Spark/Hive versions used that works? Have a git repo included in a doc 
with all the right versions and libraries. Why not? We can start from there and 
progressively use newer libraries in case the doc becomes stale. I am not 
really asking much, I just want to know what the documenter used to claim that 
Hive on Spark works, that's it.
Clearly, for most cases, this setup is broken and it misleads people to waste 
time on a broken setup.
I love this tech. But I do notice that there is some mean spirited or very 
negligent actions made by the apache development community. Documenting hive on 
spark while knowing it won't work for most cases means apache developers don't 
give a crap about the time wasted by people like us.

 

On Friday, March 17, 2017 1:14 PM, Edward Capriolo  
wrote:
 

 

On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  
wrote:

I have been in a similar world of pain. Basically, I tried to use an external 
Hive to have user access controls with a spark engine.At the end, I realized 
that it was a better idea to use apache tez instead of a spark engine for my 
particular case.
But the journey is what I want to share with you.The big data apache tools and 
libraries such as Hive, Tez, Spark, Hadoop , Parquet etc etc are not 
interchangeable as we would like to think. There are very limited combinations 
for very specific versions. This is why tools like Ambari can be useful. Ambari 
sets a path of combos of versions known to work and the dirty work is done 
under the UI. 
More often than not, when you try a version that few people tried, you will get 
error messages that will derailed you and cause you to waste a lot of time.
In addition, this group, as well as many other apache big data user groups,  
provides extremely poor support for users. The answers you usually get are not 
even hints to a solution. Their answers usually translate to "there is nothing 
I am willing to do about your problem. If I did, I should get paid" in many 
cryptic ways.
If you ask your question to the Spark group they will take you to the Hive 
group and viceversa (I can almost guarantee it based on previous experiences)
But in hindsight, people who work on this kinds of things typically make more 
money that the average developers. If you make more $$s it makes sense learning 
this stuff is supposed to be harder.
Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if you 
are querying large files.
 

On Friday, March 17, 2017 11:33 AM, Stephen Sprague  
wrote:
 

 :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work 
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker to 
me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague  wrote:

hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental versions) 
and running up against this class not found:

java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener


searching the Cyber i find this:
    1. http://stackoverflow.com/ questions/41953688/setting- 
spark-as-default-execution- engine-for-hive

    which pretty much describes my situation too and it references this:


    2. https://issues.apache.org/ jira/browse/SPARK-17563

    which indicates a "won't fix" - but does reference this:


    3. https://issues.apache.org/ jira/browse/HIVE-14029

    which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.





   

Stephan,  
I understand some of your frustration.  Remember that many in open source are 
volunteering their time. This is why if you pay a vendor for support of some 
software you might pay 50K a year or $200.00 an hour. If I was your 
vendor/consultant I would have started the clock 10 minutes ago just to answer 
this email :). The only "pay" I ever got from Hive is that I can use it as a 
resume bullet point, and I wrote a book which pays me royalties.
As it relates specifically to your problem, when you see the trends you are 
seeing it probably means you are in a minority of the user base. Either your 
doing something no one else is doing, you are too cutting edge, or no one has 
an easy solution. Hive is making the move from the classic

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague

thanks for the comments and for sure all relevant. And yeah I feel the pain
just like the next guy but that's the part of the opensource "life style"
you subscribe to when using it.

The upside payoff has gotta be worth the downside risk - or else forget
about it right? Here in the Hive world in my experience anyway its been
great.  Gotta roll with it, be courteous, be persistent and sometimes
things just work out.

Getting back to Spark and Tez yes by all means i'm a big Tez user aleady so
i was hoping to see what Spark brought to table and i didn't want to diddle
around with Spark < 2.0.   That's cool. I can live with that not being
nailed down yet. I'll just wait for hive 2.2 and rattle the cage again! ha!


All good!

Cheers,
Stephen.

On Fri, Mar 17, 2017 at 1:14 PM, Edward Capriolo 
wrote:

>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  > wrote:
>
>> I have been in a similar world of pain. Basically, I tried to use an
>> external Hive to have user access controls with a spark engine.
>> At the end, I realized that it was a better idea to use apache tez
>> instead of a spark engine for my particular case.
>>
>> But the journey is what I want to share with you.
>> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop
>> , Parquet etc etc are not interchangeable as we would like to think. There
>> are very limited combinations for very specific versions. This is why tools
>> like Ambari can be useful. Ambari sets a path of combos of versions known
>> to work and the dirty work is done under the UI.
>>
>> More often than not, when you try a version that few people tried, you
>> will get error messages that will derailed you and cause you to waste a lot
>> of time.
>>
>> In addition, this group, as well as many other apache big data user
>> groups,  provides extremely poor support for users. The answers you usually
>> get are not even hints to a solution. Their answers usually translate to
>> "there is nothing I am willing to do about your problem. If I did, I should
>> get paid" in many cryptic ways.
>>
>> If you ask your question to the Spark group they will take you to the
>> Hive group and viceversa (I can almost guarantee it based on previous
>> experiences)
>>
>> But in hindsight, people who work on this kinds of things typically make
>> more money that the average developers. If you make more $$s it makes sense
>> learning this stuff is supposed to be harder.
>>
>> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
>> you are querying large files.
>>
>>
>>
>> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
>> wrote:
>>
>>
>> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will
>> work with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal
>> breaker to me, alas.
>>
>> thanks in advance.
>>
>> Cheers,
>> Stephen.
>>
>> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
>> wrote:
>>
>> hi guys,
>> wondering where we stand with Hive On Spark these days?
>>
>> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
>> versions) and running up against this class not found:
>>
>> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>>
>>
>> searching the Cyber i find this:
>> 1. http://stackoverflow.com/ questions/41953688/setting-
>> spark-as-default-execution- engine-for-hive
>> 
>>
>> which pretty much describes my situation too and it references this:
>>
>>
>> 2. https://issues.apache.org/ jira/browse/SPARK-17563
>> 
>>
>> which indicates a "won't fix" - but does reference this:
>>
>>
>> 3. https://issues.apache.org/ jira/browse/HIVE-14029
>> 
>>
>> which looks to be fixed in hive 2.2 - which is not released yet.
>>
>>
>> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
>> 2.2?
>>
>> thanks,
>> Stephen.
>>
>>
>>
>>
>>
> Stephan,
>
> I understand some of your frustration.  Remember that many in open source
> are volunteering their time. This is why if you pay a vendor for support of
> some software you might pay 50K a year or $200.00 an hour. If I was your
> vendor/consultant I would have started the clock 10 minutes ago just to
> answer this email :). The only "pay" I ever got from Hive is that I can use
> it as a resume bullet point, and I wrote a book which pays me royalties.
>
> As it relates specifically to your problem, when you see the trends you
> are seeing it probably means you are in a minority of the user base. Either
> your doing something no one else is doing, you are too cutting edge, or no
> one has an easy solution. Hive is making the move from the classic
> MapReduce, two other execution engines have been made Tez and HiveOnSpark.
> Because we

Re: hive on spark - version question

2017-03-17 Thread Edward Capriolo

On Fri, Mar 17, 2017 at 2:56 PM, hernan saab 
wrote:

> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> 
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> 
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
> 
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>
>
>
>
Stephan,

I understand some of your frustration.  Remember that many in open source
are volunteering their time. This is why if you pay a vendor for support of
some software you might pay 50K a year or $200.00 an hour. If I was your
vendor/consultant I would have started the clock 10 minutes ago just to
answer this email :). The only "pay" I ever got from Hive is that I can use
it as a resume bullet point, and I wrote a book which pays me royalties.

As it relates specifically to your problem, when you see the trends you are
seeing it probably means you are in a minority of the user base. Either
your doing something no one else is doing, you are too cutting edge, or no
one has an easy solution. Hive is making the move from the classic
MapReduce, two other execution engines have been made Tez and HiveOnSpark.
Because we are open source we allow people to "scratch an itch" that is the
Apache way. From time to time in means something that was added stops being
viable because of lack of support.

I agree with your final assessment which is Tez is the most viable engine
for Hive. This is by no means a put down of the HiveOnSpark work and it
does not mean it will never the most viable. By the same token if the
versions fall out of sync and all that exists is complains the viability
speaks for itself.

Remember that keeping two fast moving things together is no easy chore. I
used to run Hive + cassandra. Seems easy, crap two versions of common CLI,
shade one version everything works, crap new hive release has different
versions of thrift, shade + patch, crap now one of the other dependencies
is incompatible fork + shade + patch. At some point you have to say to
yourself if I can not make critical mass of this solution such that I am
the only one doing/patching it then

Re: hive on spark - version question

2017-03-17 Thread hernan saab

I have been in a similar world of pain. Basically, I tried to use an external 
Hive to have user access controls with a spark engine.At the end, I realized 
that it was a better idea to use apache tez instead of a spark engine for my 
particular case.
But the journey is what I want to share with you.The big data apache tools and 
libraries such as Hive, Tez, Spark, Hadoop , Parquet etc etc are not 
interchangeable as we would like to think. There are very limited combinations 
for very specific versions. This is why tools like Ambari can be useful. Ambari 
sets a path of combos of versions known to work and the dirty work is done 
under the UI. 
More often than not, when you try a version that few people tried, you will get 
error messages that will derailed you and cause you to waste a lot of time.
In addition, this group, as well as many other apache big data user groups,  
provides extremely poor support for users. The answers you usually get are not 
even hints to a solution. Their answers usually translate to "there is nothing 
I am willing to do about your problem. If I did, I should get paid" in many 
cryptic ways.
If you ask your question to the Spark group they will take you to the Hive 
group and viceversa (I can almost guarantee it based on previous experiences)
But in hindsight, people who work on this kinds of things typically make more 
money that the average developers. If you make more $$s it makes sense learning 
this stuff is supposed to be harder.
Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if you 
are querying large files.
 

On Friday, March 17, 2017 11:33 AM, Stephen Sprague  
wrote:
 

 :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work 
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker to 
me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague  wrote:

hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental versions) 
and running up against this class not found:

java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener


searching the Cyber i find this:
    1. http://stackoverflow.com/ questions/41953688/setting- 
spark-as-default-execution- engine-for-hive

    which pretty much describes my situation too and it references this:


    2. https://issues.apache.org/ jira/browse/SPARK-17563

    which indicates a "won't fix" - but does reference this:


    3. https://issues.apache.org/ jira/browse/HIVE-14029

    which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague

:(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
to me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
wrote:

> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/questions/41953688/setting-
> spark-as-default-execution-engine-for-hive
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/jira/browse/SPARK-17563
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/jira/browse/HIVE-14029
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>

RE: Hive on Spark not working

2016-11-29 Thread Joaquin Alzola

Being unable to integrate separately Hive with Spark I just started directly on 
Spark the thrift server.
Now it is working as expected.

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: 29 November 2016 11:12
To: user <user@hive.apache.org>
Subject: Re: Hive on Spark not working

Hive on Spark engine only works with Spark 1.3.1.


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 November 2016 at 07:56, Furcy Pin 
<furcy@flaminem.com<mailto:furcy@flaminem.com>> wrote:
ClassNotFoundException generally means that jars are missing from your class 
path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
<joaquin.alz...@lebara.com<mailto:joaquin.alz...@lebara.com>> wrote:
Hi Guys

No matter what I do that when I execute “select count(*) from employee” I get 
the following output on the logs:
It is quiet funny because if I put hive.execution.engine=mr the output is 
correct. If I put hive.execution.engine=spark then I get the bellow errors.
If I do the search directly through spark-shell it work great.
+---+
|_c0|
+---+
|1005635|
+---+
So there has to be a problem from hive to spark.

Seems as the RPC(??) connection is not setup …. Can somebody guide me on what 
to look for.
spark.master=spark://172.16.173.31:7077<http://172.16.173.31:7077>
hive.execution.engine=spark
spark.executor.extraClassPath
/mnt/spark/lib/spark-1.6.2-yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar

Hive2.0.1--> Spark 1.6.2 –> Hadoop – 2.6.5 --> Scala 2.10

2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher 
(RpcDispatcher.java:handleError(142)) - Received error 
message:io.netty.handler.codec.DecoderException: 
java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:358)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at 
io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Nati

RE: Hive on Spark not working

2016-11-29 Thread Joaquin Alzola

HI Mich

I read in some older post that you make it work as well with the configuration 
I have:
Hive2.0.1--> Spark 1.6.2 –> Hadoop – 2.6.5 --> Scala 2.10
You only make it work with Hive 1.2.1 --> Spark 1.3.1 --> etc ….?

BR

Joaquin

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: 29 November 2016 11:12
To: user <user@hive.apache.org>
Subject: Re: Hive on Spark not working

Hive on Spark engine only works with Spark 1.3.1.


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 November 2016 at 07:56, Furcy Pin 
<furcy@flaminem.com<mailto:furcy@flaminem.com>> wrote:
ClassNotFoundException generally means that jars are missing from your class 
path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
<joaquin.alz...@lebara.com<mailto:joaquin.alz...@lebara.com>> wrote:
Hi Guys

No matter what I do that when I execute “select count(*) from employee” I get 
the following output on the logs:
It is quiet funny because if I put hive.execution.engine=mr the output is 
correct. If I put hive.execution.engine=spark then I get the bellow errors.
If I do the search directly through spark-shell it work great.
+---+
|_c0|
+---+
|1005635|
+---+
So there has to be a problem from hive to spark.

Seems as the RPC(??) connection is not setup …. Can somebody guide me on what 
to look for.
spark.master=spark://172.16.173.31:7077<http://172.16.173.31:7077>
hive.execution.engine=spark
spark.executor.extraClassPath
/mnt/spark/lib/spark-1.6.2-yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar

Hive2.0.1--> Spark 1.6.2 –> Hadoop – 2.6.5 --> Scala 2.10

2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher 
(RpcDispatcher.java:handleError(142)) - Received error 
message:io.netty.handler.codec.DecoderException: 
java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:358)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at 
io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)

Re: Hive on Spark not working

2016-11-29 Thread Mich Talebzadeh

Hive on Spark engine only works with Spark 1.3.1.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 November 2016 at 07:56, Furcy Pin  wrote:

> ClassNotFoundException generally means that jars are missing from your
> class path.
>
> You probably need to link the spark jar to $HIVE_HOME/lib
> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive
>
> On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola  > wrote:
>
>> Hi Guys
>>
>>
>>
>> No matter what I do that when I execute “select count(*) from employee” I
>> get the following output on the logs:
>>
>> It is quiet funny because if I put hive.execution.engine=mr the output is
>> correct. If I put hive.execution.engine=spark then I get the bellow errors.
>>
>> If I do the search directly through spark-shell it work great.
>>
>> +---+
>>
>> |_c0|
>>
>> +---+
>>
>> |1005635|
>>
>> +---+
>>
>> So there has to be a problem from hive to spark.
>>
>>
>>
>> Seems as the RPC(??) connection is not setup …. Can somebody guide me on
>> what to look for.
>>
>> spark.master=spark://172.16.173.31:7077
>>
>> hive.execution.engine=spark
>>
>> spark.executor.extraClassPath/mnt/spark/lib/spark-1.6.2-yar
>> n-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar
>>
>>
>>
>> Hive2.0.1à Spark 1.6.2 –> Hadoop – 2.6.5 à Scala 2.10
>>
>>
>>
>> 2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher
>> (RpcDispatcher.java:handleError(142)) - Received error
>> message:io.netty.handler.codec.DecoderException:
>> java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
>>
>> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>> oMessageDecoder.java:358)
>>
>> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>> ToMessageDecoder.java:230)
>>
>> at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteTo
>> MessageCodec.java:103)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:308)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:294)
>>
>> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>> annelInboundHandlerAdapter.java:86)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:308)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:294)
>>
>> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>> ultChannelPipeline.java:846)
>>
>> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.
>> read(AbstractNioByteChannel.java:131)
>>
>> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
>> tLoop.java:511)
>>
>> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz
>> ed(NioEventLoop.java:468)
>>
>> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEve
>> ntLoop.java:382)
>>
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>
>> at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(
>> SingleThreadEventExecutor.java:111)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/J
>> ob
>>
>> at java.lang.ClassLoader.defineClass1(Native Method)
>>
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>>
>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>> r.java:142)
>>
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> at

Re: Hive on Spark not working

2016-11-28 Thread Furcy Pin

ClassNotFoundException generally means that jars are missing from your
class path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
wrote:

> Hi Guys
>
>
>
> No matter what I do that when I execute “select count(*) from employee” I
> get the following output on the logs:
>
> It is quiet funny because if I put hive.execution.engine=mr the output is
> correct. If I put hive.execution.engine=spark then I get the bellow errors.
>
> If I do the search directly through spark-shell it work great.
>
> +---+
>
> |_c0|
>
> +---+
>
> |1005635|
>
> +---+
>
> So there has to be a problem from hive to spark.
>
>
>
> Seems as the RPC(??) connection is not setup …. Can somebody guide me on
> what to look for.
>
> spark.master=spark://172.16.173.31:7077
>
> hive.execution.engine=spark
>
> spark.executor.extraClassPath/mnt/spark/lib/spark-1.6.2-
> yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar
>
>
>
> Hive2.0.1à Spark 1.6.2 –> Hadoop – 2.6.5 à Scala 2.10
>
>
>
> 2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher
> (RpcDispatcher.java:handleError(142)) - Received error
> message:io.netty.handler.codec.DecoderException: 
> java.lang.NoClassDefFoundError:
> org/apache/hive/spark/client/Job
>
> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(
> ByteToMessageDecoder.java:358)
>
> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(
> ByteToMessageDecoder.java:230)
>
> at io.netty.handler.codec.ByteToMessageCodec.channelRead(
> ByteToMessageCodec.java:103)
>
> at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:308)
>
> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:294)
>
> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(
> ChannelInboundHandlerAdapter.java:86)
>
> at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:308)
>
> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:294)
>
> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(
> DefaultChannelPipeline.java:846)
>
> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(
> AbstractNioByteChannel.java:131)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:511)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:468)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:382)
>
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:111)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/
> Job
>
> at java.lang.ClassLoader.defineClass1(Native Method)
>
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>
> at java.security.SecureClassLoader.defineClass(
> SecureClassLoader.java:142)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at org.apache.hive.com.esotericsoftware.kryo.util.
> DefaultClassResolver.readName(DefaultClassResolver.java:154)
>
> at org.apache.hive.com.esotericsoftware.kryo.util.
> DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>
> at org.apache.hive.com.esotericsoftware.kryo.Kryo.
> readClass(Kryo.java:670)
>
> at org.apache.hive.com.esotericsoftware.kryo.
> serializers.ObjectField.read(ObjectField.java:118)
>
> at org.apache.hive.com.esotericsoftware.kryo.
> serializers.FieldSerializer.read(FieldSerializer.java:551)
>
> at org.apache.hive.com.esotericsoftware.kryo.Kryo.
> readClassAndObject(Kryo.java:790)
>
> at org.apache.hive.spark.client.rpc.KryoMessageCodec.decode(
>

Re: Hive on Spark - Mesos

2016-09-15 Thread Mich Talebzadeh

sorry on Yarn only but I gather it should work with Mesos. I don't think
that comes into it.

The issue is the compatibility of Spark assembly library with Hive.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 September 2016 at 22:41, John Omernik  wrote:

> Did you run it on Mesos? Your presentation doesn't mention Mesos at all...
>
> John
>
>
> On Thu, Sep 15, 2016 at 4:20 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Yes you can. Hive on Spark meaning Hive using Spark as its execution
>> engine works fine. The version that I managed to make it work  is any Hive
>> version> 1,2 with Spark 1.3.1.
>>
>> You  need to build Spark from the source code excluding Hive libraries.
>>
>> Check my attached presentation.
>>
>>  HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 September 2016 at 22:10, John Omernik  wrote:
>>
>>> Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
>>> trying to get it to run on bleeding edge Spark (2.0).
>>>
>>> Spark is working fine, I can query the data all is setup, however, I
>>> can't get Hive on Spark to work. I understand it's not really a thing (Hive
>>> on Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
>>> (I.e. is there some reason why this shouldn't work other than it just
>>> hasn't been attempted?)
>>>
>>> The error I am getting is odd.. (see below) not sure why that would pop
>>> up, everything seems right other wise... any help would be appreciated.
>>>
>>> John
>>>
>>>
>>>
>>>
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>>
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>>
>>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>>> r.java:142)
>>>
>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>>
>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>>
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> at java.lang.Class.forName0(Native Method)
>>>
>>> at java.lang.Class.forName(Class.java:348)
>>>
>>> at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>> $SparkSubmit$$runMain(SparkSubmit.scala:686)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>> .scala:185)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>>>
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.JavaSparkListener
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> ... 20 more
>>>
>>>
>>> at org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcS
>>> erver.java:179)
>>>
>>> at org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClie
>>> ntImpl.java:465)
>>>
>>
>>
>

Re: Hive on Spark - Mesos

2016-09-15 Thread John Omernik

Did you run it on Mesos? Your presentation doesn't mention Mesos at all...

John


On Thu, Sep 15, 2016 at 4:20 PM, Mich Talebzadeh 
wrote:

> Yes you can. Hive on Spark meaning Hive using Spark as its execution
> engine works fine. The version that I managed to make it work  is any Hive
> version> 1,2 with Spark 1.3.1.
>
> You  need to build Spark from the source code excluding Hive libraries.
>
> Check my attached presentation.
>
>  HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 September 2016 at 22:10, John Omernik  wrote:
>
>> Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
>> trying to get it to run on bleeding edge Spark (2.0).
>>
>> Spark is working fine, I can query the data all is setup, however, I
>> can't get Hive on Spark to work. I understand it's not really a thing (Hive
>> on Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
>> (I.e. is there some reason why this shouldn't work other than it just
>> hasn't been attempted?)
>>
>> The error I am getting is odd.. (see below) not sure why that would pop
>> up, everything seems right other wise... any help would be appreciated.
>>
>> John
>>
>>
>>
>>
>> at java.lang.ClassLoader.defineClass1(Native Method)
>>
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>
>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>> r.java:142)
>>
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> at java.lang.Class.forName0(Native Method)
>>
>> at java.lang.Class.forName(Class.java:348)
>>
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
>>
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:686)
>>
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:185)
>>
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.JavaSparkListener
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> ... 20 more
>>
>>
>> at org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcS
>> erver.java:179)
>>
>> at org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClie
>> ntImpl.java:465)
>>
>
>

Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

2016-09-14 Thread Benjamin Schaff

Hi,

Thanks for the answer.

I am running on a custom build of spark 1.6.2 meaning the one given in the
hive documentation so without hive jars.
I set it up in hive-env.sh.

I created the istari table like in the documentation and I run INSERT on it
then a GROUP BY.
Everything went on spark standalone cluster correctly not exception nowhere.

Do you have any other suggestion ?

Thanks.

Le mer. 14 sept. 2016 à 13:55, Mich Talebzadeh 
a écrit :

> Hi,
>
> You are using Hive 2. What is the Spark version that runs as Hive
> execution engine?
>
> I cannot see spark.home in your hive-site.xml so I cannot figure it out.
>
> BTW you are using Spark standalone as the mode. I tend to use yarn-client.
>
> Now back to the above issue. Do other queries work OK with Hive on Spark?
>
> Some of those perf parameters can be set up in Hive session itself or
> through init file
>
>  set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
> set spark.master=yarn;
> set spark.deploy.mode=client;
> set spark.executor.memory=8g;
> set spark.driver.memory=8g;
> set spark.executor.instances=6;
> set spark.ui.port=;
>
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 September 2016 at 18:28, Benjamin Schaff 
> wrote:
>
>> Hi,
>>
>> After several days trying to figure out the problem I'm stuck with a
>> class cast exception when running a query with hive on spark on orc tables
>> that I updated with the streaming mutation api of hive 2.0.
>>
>> The context is the following:
>>
>> For hive:
>>
>> The version is the latest available from the website 2.1
>> I created some scala code to insert data into an orc table with the
>> streaming mutation api followed the example provided somewhere in the hive
>> repository.
>>
>> The table looks like that:
>>
>> ++--+
>> |   createtab_stmt   |
>> ++--+
>> | CREATE TABLE `hc__member`( |
>> |   `rdv_core__key` bigint,  |
>> |   `rdv_core__domainkey` string,|
>> |   `rdftypes` array,|
>> |   `rdv_org__firstname` string, |
>> |   `rdv_org__middlename` string,|
>> |   `rdv_org__lastname` string,  |
>> |   `rdv_org__gender` string,|
>> |   `rdv_org__city` string,  |
>> |   `rdv_org__state` string, |
>> |   `rdv_org__countrycode` string,   |
>> |   `rdv_org__addresslabel` string,  |
>> |   `rdv_org__zip` string)   |
>> | CLUSTERED BY ( |
>> |   rdv_core__key)   |
>> | INTO 24 BUCKETS|
>> | ROW FORMAT SERDE   |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  |
>> | STORED AS INPUTFORMAT  |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'|
>> | OUTPUTFORMAT   |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'   |
>> | LOCATION   |
>> |   'hdfs://hmaster:8020/user/hive/warehouse/hc__member' |
>> | TBLPROPERTIES (|
>> |   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',|
>> |   'compactor.mapreduce.map.memory.mb'='2048',  |
>> |   'compactorthreshold.hive.compactor.delta.num.threshold'='4', |
>> |   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
>> |   'numFiles'='0',  |
>> |   'numRows'='0',   |
>> |   'rawDataSize'='0',   |
>> |

Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

2016-09-14 Thread Mich Talebzadeh

Hi,

You are using Hive 2. What is the Spark version that runs as Hive execution
engine?

I cannot see spark.home in your hive-site.xml so I cannot figure it out.

BTW you are using Spark standalone as the mode. I tend to use yarn-client.

Now back to the above issue. Do other queries work OK with Hive on Spark?

Some of those perf parameters can be set up in Hive session itself or
through init file

 set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=8g;
set spark.driver.memory=8g;
set spark.executor.instances=6;
set spark.ui.port=;


HTH








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 18:28, Benjamin Schaff 
wrote:

> Hi,
>
> After several days trying to figure out the problem I'm stuck with a class
> cast exception when running a query with hive on spark on orc tables that I
> updated with the streaming mutation api of hive 2.0.
>
> The context is the following:
>
> For hive:
>
> The version is the latest available from the website 2.1
> I created some scala code to insert data into an orc table with the
> streaming mutation api followed the example provided somewhere in the hive
> repository.
>
> The table looks like that:
>
> ++--+
> |   createtab_stmt   |
> ++--+
> | CREATE TABLE `hc__member`( |
> |   `rdv_core__key` bigint,  |
> |   `rdv_core__domainkey` string,|
> |   `rdftypes` array,|
> |   `rdv_org__firstname` string, |
> |   `rdv_org__middlename` string,|
> |   `rdv_org__lastname` string,  |
> |   `rdv_org__gender` string,|
> |   `rdv_org__city` string,  |
> |   `rdv_org__state` string, |
> |   `rdv_org__countrycode` string,   |
> |   `rdv_org__addresslabel` string,  |
> |   `rdv_org__zip` string)   |
> | CLUSTERED BY ( |
> |   rdv_core__key)   |
> | INTO 24 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'|
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'   |
> | LOCATION   |
> |   'hdfs://hmaster:8020/user/hive/warehouse/hc__member' |
> | TBLPROPERTIES (|
> |   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',|
> |   'compactor.mapreduce.map.memory.mb'='2048',  |
> |   'compactorthreshold.hive.compactor.delta.num.threshold'='4', |
> |   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
> |   'numFiles'='0',  |
> |   'numRows'='0',   |
> |   'rawDataSize'='0',   |
> |   'totalSize'='0', |
> |   'transactional'='true',  |
> |   'transient_lastDdlTime'='1473792939')|
> ++--+
>
> The hive site looks like that:
>
> 
>  
> hive.execution.engine
> spark
>   
>   
> spark.master
> spark://hmaster:7077
>   
>   
> spark.eventLog.enabled
> false
>   
>   
> spark.executor.memory
> 12g
>   
>   
> spark.serializer
> org.apache.spark.serializer.KryoSerializer
>   
>   
>

Re: hive on spark job not start enough executors

2016-09-09 Thread 明浩冯

All the parameters except spark.executor.instances are specified in 
spark-default.conf located in hive's conf folder.  So I think it's a yes.

I also checked on spark's web page when a hive on spark job is running, the 
parameters shown on the web page are exactly what I specified in the config 
file including spark.shuffle.service.enabled and 
spark.dynamicAllocation.enabled.


Should I specify a fixed executor.instances in the file? But it's not good for 
me.


By the way, the data source of my query is parquet files. In hive side I just 
created a external table from the parquet.



Thanks,

Minghao Feng


From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: Friday, September 9, 2016 4:49:55 PM
To: user
Subject: Re: hive on spark job not start enough executors

when you start hive on spark do you set any parameters for the submitted job 
(or read them from init file)?

set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.instances=2;
set spark.ui.port=;


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 9 September 2016 at 09:30, ?? ? 
<qiuff...@hotmail.com<mailto:qiuff...@hotmail.com>> wrote:

Hi there,


I encountered a problem that makes hive on spark with a very low performance.

I'm using spark 1.6.2 and hive 2.1.0, I specified


spark.shuffle.service.enabledtrue
spark.dynamicAllocation.enabled  true

in my spark-default.conf file (the file is in both spark and hive conf folder) 
to make spark job to get executors dynamically.
The configuration works correctly when I run spark jobs, but when I use hive on 
spark, it only started a few executors although there are more enough cores and 
memories to start more executors.
For example, for the same SQL query, if I run on sparkSQL, it can start more 
than 20 executors, but with hive on spark, only 3.

How can I improve the performance on hive on spark? Any suggestions please.

Thanks,
Minghao Feng

Re: hive on spark job not start enough executors

2016-09-09 Thread Mich Talebzadeh

when you start hive on spark do you set any parameters for the submitted
job (or read them from init file)?

set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.instances=2;
set spark.ui.port=;

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 September 2016 at 09:30, ?? ?  wrote:

> Hi there,
>
>
> I encountered a problem that makes hive on spark with a very low
> performance.
>
> I'm using spark 1.6.2 and hive 2.1.0, I specified
>
>
> spark.shuffle.service.enabledtrue
> spark.dynamicAllocation.enabled  true
>
> in my spark-default.conf file (the file is in both spark and hive conf
> folder) to make spark job to get executors dynamically.
> The configuration works correctly when I run spark jobs, but when I use
> hive on spark, it only started a few executors although there are more
> enough cores and memories to start more executors.
> For example, for the same SQL query, if I run on sparkSQL, it can start
> more than 20 executors, but with hive on spark, only 3.
>
> How can I improve the performance on hive on spark? Any suggestions please.
>
> Thanks,
> Minghao Feng
>
>

Re: Hive on spark

2016-08-01 Thread Mich Talebzadeh

Hi,

You can download the pdf from here
<https://talebzadehmich.files.wordpress.com/2016/08/hive_on_spark_only.pdf>

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 03:05, Chandrakanth Akkinepalli <
chandrakanth.akkinepa...@gmail.com> wrote:

> Hi Dr.Mich,
> Can you please share your London meetup presentation. Curious to see the
> comparison according to you of various query engines.
>
> Thanks,
> Chandra
>
> On Jul 28, 2016, at 12:13 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi,
>
> I made a presentation in London on 20th July on this subject:. In that I
> explained how to make Spark work as an execution engine for Hive.
>
> Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations
> <http://www.meetup.com/futureofdata-london/events/232423292/>!
>
> See if I can send the presentation
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 04:24, Mudit Kumar <mudit.ku...@askme.in> wrote:
>
>> Yes Mich,exactly.
>>
>> Thanks,
>> Mudit
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Reply-To: <user@hive.apache.org>
>> Date: Thursday, July 28, 2016 at 1:08 AM
>> To: user <user@hive.apache.org>
>> Subject: Re: Hive on spark
>>
>> You mean you want to run Hive using Spark as the execution engine which
>> uses Yarn by default?
>>
>>
>> Something like below
>>
>> hive> select max(id) from oraclehadoop.dummy_parquet;
>> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
>> Query Hive on Spark job[1] stages:
>> 2
>> 3
>> Status: Running (Hive on Spark job[1])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId:
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>> [StageCost]
>> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
>> Finished
>> Status: Finished successfully in 13.14 seconds
>> OK
>> 1
>> Time taken: 13.426 seconds, Fetched: 1 row(s)
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 July 2016 at 20:31, Mudit Kumar <mudit.ku...@askme.in> wrote:
>>
>>> Hi All,
>>>
>>> I need to configure hive cluster based on spark engine (yarn).
>>> I already have a running hadoop cluster.
>>>
>>> Can someone point me to relevant documentation?
>>>
>>> TIA.
>>>
>>> Thanks,
>>> Mudit
>>>
>>
>>
>

Re: Hive on spark

2016-07-31 Thread Chandrakanth Akkinepalli

Hi Dr.Mich,
Can you please share your London meetup presentation. Curious to see the 
comparison according to you of various query engines.

Thanks,
Chandra

> On Jul 28, 2016, at 12:13 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I made a presentation in London on 20th July on this subject:. In that I 
> explained how to make Spark work as an execution engine for Hive.
> 
> Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations!
> 
> See if I can send the presentation
> 
> Cheers
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 28 July 2016 at 04:24, Mudit Kumar <mudit.ku...@askme.in> wrote:
>> Yes Mich,exactly.
>> 
>> Thanks,
>> Mudit
>> 
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Reply-To: <user@hive.apache.org>
>> Date: Thursday, July 28, 2016 at 1:08 AM
>> To: user <user@hive.apache.org>
>> Subject: Re: Hive on spark
>> 
>> You mean you want to run Hive using Spark as the execution engine which uses 
>> Yarn by default?
>> 
>> 
>> Something like below
>> 
>> hive> select max(id) from oraclehadoop.dummy_parquet;
>> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
>> Query Hive on Spark job[1] stages:
>> 2
>> 3
>> Status: Running (Hive on Spark job[1])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1 
>> Finished
>> Status: Finished successfully in 13.14 seconds
>> OK
>> 1
>> Time taken: 13.426 seconds, Fetched: 1 row(s)
>> 
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 27 July 2016 at 20:31, Mudit Kumar <mudit.ku...@askme.in> wrote:
>>> Hi All,
>>> 
>>> I need to configure hive cluster based on spark engine (yarn).
>>> I already have a running hadoop cluster.
>>> 
>>> Can someone point me to relevant documentation?
>>> 
>>> TIA.
>>> 
>>> Thanks,
>>> Mudit
>

Re: Hive on spark

2016-07-28 Thread Mudit Kumar

Thanks Guys for the help!

Thanks,
Mudit

From:  Mich Talebzadeh <mich.talebza...@gmail.com>
Reply-To:  <user@hive.apache.org>
Date:  Thursday, July 28, 2016 at 9:43 AM
To:  user <user@hive.apache.org>
Subject:  Re: Hive on spark

Hi,

I made a presentation in London on 20th July on this subject:. In that I 
explained how to make Spark work as an execution engine for Hive.

Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations! 

See if I can send the presentation 

Cheers

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

On 28 July 2016 at 04:24, Mudit Kumar <mudit.ku...@askme.in> wrote:
Yes Mich,exactly.

Thanks,
Mudit

From:  Mich Talebzadeh <mich.talebza...@gmail.com>
Reply-To:  <user@hive.apache.org>
Date:  Thursday, July 28, 2016 at 1:08 AM
To:  user <user@hive.apache.org>
Subject:  Re: Hive on spark

You mean you want to run Hive using Spark as the execution engine which uses 
Yarn by default?

Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)

HTH

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

On 27 July 2016 at 20:31, Mudit Kumar <mudit.ku...@askme.in> wrote:
Hi All,

I need to configure hive cluster based on spark engine (yarn).
I already have a running hadoop cluster.

Can someone point me to relevant documentation?

TIA.

Thanks,
Mudit

Re: Hive on spark

2016-07-27 Thread Mich Talebzadeh

Hi,

I made a presentation in London on 20th July on this subject:. In that I
explained how to make Spark work as an execution engine for Hive.

Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations
<http://www.meetup.com/futureofdata-london/events/232423292/>!

See if I can send the presentation

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 04:24, Mudit Kumar <mudit.ku...@askme.in> wrote:

> Yes Mich,exactly.
>
> Thanks,
> Mudit
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Reply-To: <user@hive.apache.org>
> Date: Thursday, July 28, 2016 at 1:08 AM
> To: user <user@hive.apache.org>
> Subject: Re: Hive on spark
>
> You mean you want to run Hive using Spark as the execution engine which
> uses Yarn by default?
>
>
> Something like below
>
> hive> select max(id) from oraclehadoop.dummy_parquet;
> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 13.14 seconds
> OK
> 1
> Time taken: 13.426 seconds, Fetched: 1 row(s)
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 July 2016 at 20:31, Mudit Kumar <mudit.ku...@askme.in> wrote:
>
>> Hi All,
>>
>> I need to configure hive cluster based on spark engine (yarn).
>> I already have a running hadoop cluster.
>>
>> Can someone point me to relevant documentation?
>>
>> TIA.
>>
>> Thanks,
>> Mudit
>>
>
>

Re: Hive on spark

2016-07-27 Thread karthi keyan

mudit,

this link can guide you -
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Thanks,
Karthik

On Thu, Jul 28, 2016 at 8:54 AM, Mudit Kumar <mudit.ku...@askme.in> wrote:

> Yes Mich,exactly.
>
> Thanks,
> Mudit
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Reply-To: <user@hive.apache.org>
> Date: Thursday, July 28, 2016 at 1:08 AM
> To: user <user@hive.apache.org>
> Subject: Re: Hive on spark
>
> You mean you want to run Hive using Spark as the execution engine which
> uses Yarn by default?
>
>
> Something like below
>
> hive> select max(id) from oraclehadoop.dummy_parquet;
> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 13.14 seconds
> OK
> 1
> Time taken: 13.426 seconds, Fetched: 1 row(s)
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 July 2016 at 20:31, Mudit Kumar <mudit.ku...@askme.in> wrote:
>
>> Hi All,
>>
>> I need to configure hive cluster based on spark engine (yarn).
>> I already have a running hadoop cluster.
>>
>> Can someone point me to relevant documentation?
>>
>> TIA.
>>
>> Thanks,
>> Mudit
>>
>
>

Re: Hive on spark

2016-07-27 Thread Mudit Kumar

Yes Mich,exactly.

Thanks,
Mudit

From:  Mich Talebzadeh <mich.talebza...@gmail.com>
Reply-To:  <user@hive.apache.org>
Date:  Thursday, July 28, 2016 at 1:08 AM
To:  user <user@hive.apache.org>
Subject:  Re: Hive on spark

You mean you want to run Hive using Spark as the execution engine which uses 
Yarn by default?

Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)

HTH

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

On 27 July 2016 at 20:31, Mudit Kumar <mudit.ku...@askme.in> wrote:
Hi All,

I need to configure hive cluster based on spark engine (yarn).
I already have a running hadoop cluster.

Can someone point me to relevant documentation?

TIA.

Thanks,
Mudit

Re: Hive on spark

2016-07-27 Thread Mich Talebzadeh

You mean you want to run Hive using Spark as the execution engine which
uses Yarn by default?


Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 July 2016 at 20:31, Mudit Kumar  wrote:

> Hi All,
>
> I need to configure hive cluster based on spark engine (yarn).
> I already have a running hadoop cluster.
>
> Can someone point me to relevant documentation?
>
> TIA.
>
> Thanks,
> Mudit
>

Re: Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh

Thanks Ted,

More interested in general availability of Hive 2 on Spark 1.6 engine as
opposed to Vendors specific custom built.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 March 2016 at 23:55, Ted Yu  wrote:

> According to:
>
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_HDP_RelNotes/bk_HDP_RelNotes-20151221.pdf
>
> Spark 1.5.2 comes out of box.
>
> Suggest moving questions on HDP to Hortonworks forum.
>
> Cheers
>
> On Sat, Mar 26, 2016 at 3:32 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Jorn.
>>
>> Just to be clear they get Hive working with Spark 1.6 out of the box
>> (binary download)? The usual work-around is to build your own package and
>> get the Hadoop-assembly jar file copied over to $HIVE_HOME/lib.
>>
>>
>> Cheers
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 26 March 2016 at 22:08, Jörn Franke  wrote:
>>
>>> If you check the newest Hortonworks distribution then you see that it
>>> generally works. Maybe you can borrow some of their packages. Alternatively
>>> it should be also available in other distributions.
>>>
>>> On 26 Mar 2016, at 22:47, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign
>>> that Hive can utilise a Spark engine higher than 1.3.1
>>>
>>> My understanding was that there were miss-match on Hadoop assembly Jar
>>> files that cause Hive not being able to run on Spark using the binary
>>> downloads. I just tried Hive 2 on Spark 1.6 as the execution engine and it
>>> crashed.
>>>
>>> I do not know the development state of this cross-breed but will be very
>>> desirable if we could manage to sort out
>>> this spark-assembly-1.x.1-hadoop2.4.0.jar for once.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>
>

Re: Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh

Thanks Jorn.

Just to be clear they get Hive working with Spark 1.6 out of the box
(binary download)? The usual work-around is to build your own package and
get the Hadoop-assembly jar file copied over to $HIVE_HOME/lib.


Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 March 2016 at 22:08, Jörn Franke  wrote:

> If you check the newest Hortonworks distribution then you see that it
> generally works. Maybe you can borrow some of their packages. Alternatively
> it should be also available in other distributions.
>
> On 26 Mar 2016, at 22:47, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign
> that Hive can utilise a Spark engine higher than 1.3.1
>
> My understanding was that there were miss-match on Hadoop assembly Jar
> files that cause Hive not being able to run on Spark using the binary
> downloads. I just tried Hive 2 on Spark 1.6 as the execution engine and it
> crashed.
>
> I do not know the development state of this cross-breed but will be very
> desirable if we could manage to sort out
> this spark-assembly-1.x.1-hadoop2.4.0.jar for once.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>

Re: Hive on Spark engine

2016-03-26 Thread Jörn Franke

If you check the newest Hortonworks distribution then you see that it generally 
works. Maybe you can borrow some of their packages. Alternatively it should be 
also available in other distributions.

> On 26 Mar 2016, at 22:47, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign that 
> Hive can utilise a Spark engine higher than 1.3.1
> 
> My understanding was that there were miss-match on Hadoop assembly Jar files 
> that cause Hive not being able to run on Spark using the binary downloads. I 
> just tried Hive 2 on Spark 1.6 as the execution engine and it crashed.
> 
> I do not know the development state of this cross-breed but will be very 
> desirable if we could manage to sort out this 
> spark-assembly-1.x.1-hadoop2.4.0.jar for once.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>

Re: Hive on Spark performance

2016-03-14 Thread sjayatheertha

Thanks for your response. We were evaluating Spark and were curious to know how 
it is used today and the lowest latency it can provide. 

> On Mar 14, 2016, at 8:37 AM, Mich Talebzadeh  
> wrote:
> 
> Hi Wlodeck,
> 
> Let us look at this.
> 
> In Oracle I have two tables channels and sales. This code works in Oracle
> 
>   1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s 
> WHERE c.channel_id = s.channel_id)) As R
>   2  from channels c
>   3* group by c.channel_id
> s...@mydb.mich.LOCAL> /
> CHANNEL_ID  R
> -- --
>  2 516050
>  31620984
>  4 473664
>  5  0
>  9  18666
> 
> I have the same tables In Hive but the same query crashes!
> 
> hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s 
> WHERE c.channel_id = s.channel_id)) As R
> > from channels c
> > group by c.channel_id
> > ;
> NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN 
> precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple | 
> precedenceEqualExpressionSingle );])
> 
> The solution is to use a temporary table to keep the sum/group by from sales 
> table as an intermediate stage  (temporary tables are session specific and 
> they are created and dropped after you finish the session)
> 
> hive> create temporary table tmp as select channel_id, count(channel_id) as 
> total from sales group by channel_id;
> 
> 
> Ok the rest is pretty easy
> 
> hive> select c.channel_id, c.channel_id * t.total as results
> > from channels c, tmp t
> > where c.channel_id = t.channel_id;
> 
> 2.0 2800432.0
> 3.0 8802300.0
> 4.0 2583552.0
> 9.0 104013.0
> 
> HTH
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 14 March 2016 at 14:22, ws  wrote:
>> Hive 1.2.1.2.3.4.0-3485
>> Spark 1.5.2
>> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
>> 
>> ### 
>> SELECT 
>>  f.description,
>>  f.item_number,
>>  sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = 
>> r.h_id)) as df_a
>> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
>> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND 
>> f.item_number NOT LIKE 'DEV%'
>> group by 
>>  f.description,
>>  f.item_number
>> ###
>> 
>> This query works fine in oracle but not Hive or Spark.
>> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where 
>> hb_h_name = r.h_id)) as df_a" field.
>> 
>> 
>> Thanks,
>> Wlodek
>> --
>> 
>> 
>> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh 
>>  wrote:
>> 
>> 
>> Depending on the version of Hive on Spark engine.
>> 
>> As far as I am aware the latest version of Hive that I am using (Hive 2) has 
>> improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark 
>> engine.
>> 
>> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is 
>> not the latest Spark but it is pretty good.
>> 
>> What specific concerns do you have in mind?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 13 March 2016 at 23:27, sjayatheertha  wrote:
>> Just curious if you could share your experience on the performance of spark 
>> in your company? How much data do you process? And what's the latency you 
>> are getting with spark engine?
>> 
>> Vidya
>

Re: Hive on Spark performance

2016-03-14 Thread Mich Talebzadeh

Hi Wlodeck,

Let us look at this.

In Oracle I have two tables channels and sales. This code works in Oracle

  1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
  2  from channels c
  3* group by c.channel_id
s...@mydb.mich.LOCAL> /
CHANNEL_ID  R
-- --
 2 516050
 31620984
 4 473664
 5  0
 9  18666

I have the same tables In Hive but the same query crashes!

hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
> from channels c
> group by c.channel_id
> ;
NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN
precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple |
precedenceEqualExpressionSingle );])

The solution is to use a temporary table to keep the sum/group by from
sales table as an intermediate stage  (temporary tables are session
specific and they are created and dropped after you finish the session)

hive> create temporary table tmp as select channel_id, count(channel_id) as
total from sales group by channel_id;


Ok the rest is pretty easy

hive> select c.channel_id, c.channel_id * t.total as results
> from channels c, tmp t
> where c.channel_id = t.channel_id;

2.0 2800432.0
3.0 8802300.0
4.0 2583552.0
9.0 104013.0

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 14:22, ws  wrote:

> Hive 1.2.1.2.3.4.0-3485
> Spark 1.5.2
> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit
> Production
>
> ###
> SELECT
> f.description,
> f.item_number,
> sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id))
> as df_a
> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND
> f.item_number NOT LIKE 'DEV%'
> group by
> f.description,
> f.item_number
> ###
>
> This query works fine in oracle but not Hive or Spark.
> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where
> hb_h_name = r.h_id)) as df_a" field.
>
>
> Thanks,
> Wlodek
> --
>
>
> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Depending on the version of Hive on Spark engine.
>
> As far as I am aware the latest version of Hive that I am using (Hive 2)
> has improvements compared to the previous versions of Hive (0.14,1.2.1) on
> Spark engine.
>
> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it
> is not the latest Spark but it is pretty good.
>
> What specific concerns do you have in mind?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 13 March 2016 at 23:27, sjayatheertha  wrote:
>
> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya
>
>
>
>
>

Re: Hive on Spark performance

2016-03-14 Thread ws

Hive 1.2.1.2.3.4.0-3485Spark 1.5.2Oracle Database 11g Enterprise Edition 
Release 11.2.0.4.0 - 64bit Production
### SELECT  f.description, f.item_number, sum(f.df_a * (select count(1) from 
e.mv_A_h_a where hb_h_name = r.h_id)) as df_aFROM e.eng_fac_atl_sc_bf_qty f, 
wv_ATL_2_qty_df_rates rwhere f.item_number NOT LIKE 'HR%' AND f.item_number NOT 
LIKE 'UG%' AND f.item_number NOT LIKE 'DEV%'group by  f.description, 
f.item_number###
This query works fine in oracle but not Hive or Spark.So the problem is: 
"sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id)) as 
df_a" field.

Thanks,Wlodek-- 

On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh 
 wrote:
 

 Depending on the version of Hive on Spark engine. 
As far as I am aware the latest version of Hive that I am using (Hive 2) has 
improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark 
engine.
As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is not 
the latest Spark but it is pretty good.
What specific concerns do you have in mind?
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 13 March 2016 at 23:27, sjayatheertha  wrote:

Just curious if you could share your experience on the performance of spark in 
your company? How much data do you process? And what's the latency you are 
getting with spark engine?

Vidya

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Edward Capriolo

educe = 0%,
> Cumulative CPU 36.67 sec
>
> INFO  : 2016-02-03 21:26:08,310 Stage-3 map = 47%,  reduce = 0%,
> Cumulative CPU 38.78 sec
>
> INFO  : 2016-02-03 21:26:11,408 Stage-3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 100%,
> Cumulative CPU 50.01 sec
>
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 10 msec
>
> INFO  : Ended Job = job_1454534517374_0002
>
> ++-+-+--+
>
> | t.calendar_month_desc  | c.channel_desc  | totalsales  |
>
> ++-+-+--+
>
> ++-+-+--+
>
> 150 rows selected (85.67 seconds)
>
>
>
> *3)**Spark on Hive engine completes in 267 sec*
>
> spark-sql> SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Time taken: 267.138 seconds, Fetched 150 row(s)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 03 February 2016 16:21
> *To:* user@hive.apache.org
> *Subject:* RE: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> OK thanks. These are my new ENV settings based upon the availability of
> resources
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
> These are the new runs after these settings:
>
>
>
> *Spark on Hive (3 consecutive runs)*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 47.987 seconds, Fetched 3 row(s)
>
>
>
> Around 48 seconds
>
>
>
> *Hive on Spark 1.3.1*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in
> (1, 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[2] stages:
>
> INFO  : 2
>
> INFO  :
>
> Status: Running (Hive on Spark job[2])
>
>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers

fair enough

On Thu, Feb 4, 2016 at 12:41 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> Hive is not the correct tool for every problem. Use the tool that makes
> the most sense for your problem and your experience.
>
> Many people like hive because it is generally applicable. In my case study
> for the hive book I highlighted many smart capably organizations use hive.
>
> Your argument is totally valid. You like X better because X works for you.
> You don't need to 'preach' hear we all know hive has it's limits.
>
> On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Is the sky the limit? I know udfs can be used inside hive, like lambas
>> basically i assume, and i will assume you have something similar for
>> aggregations. But that's just abstractions inside a single map or reduce
>> phase, pretty low level stuff. What you really need is abstractions around
>> many map and reduce phases, because that is the level an algo is expressed
>> at.
>>
>> For example when doing logistic regression you want to be able to do
>> something like:
>> read("somefile").train(settings).write("model")
>> Here train is an eternally defined method that is well tested and could
>> do many map and reduce steps internally (or even be defined at a higher
>> level and compile into those steps). What is the equivalent in hive? Copy
>> pasting crucial parts of the algo around while using udfs is just not the
>> same thing in terms of reusability and abstraction. Its the opposite of
>> keeping it DRY.
>> On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com>
>> wrote:
>>
>>> https://github.com/myui/hivemall
>>>
>>>
>>>
>>> as long as you are comfortable with java UDFs, the sky is really the
>>> limit...it's not for everyone and spark does have many advantages, but they
>>> are two tools that can complement each other in numerous ways.
>>>
>>>
>>>
>>> I don't know that there is necessarily a universal "better" for how to
>>> use spark as an execution engine (or if spark is necessarily the **best**
>>> execution engine for any given hive job).
>>>
>>>
>>>
>>> The reality is that once you start factoring in the numerous tuning
>>> parameters of the systems and jobs there probably isn't a clear answer.
>>> For some queries, the Catalyst optimizer may do a better job...is it going
>>> to do a better job with ORC based data? less likely IMO.
>>>
>>>
>>>
>>> *From:* Koert Kuipers [mailto:ko...@tresata.com]
>>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> yeah but have you ever seen somewhat write a real analytical program in
>>> hive? how? where are the basic abstractions to wrap up a large amount of
>>> operations (joins, groupby's) into a single function call? where are the
>>> tools to write nice unit test for that?
>>>
>>> for example in spark i can write a DataFrame => DataFrame that
>>> internally does many joins, groupBys and complex operations. all unit
>>> tested and perfectly re-usable. and in hive? copy paste round sql queries?
>>> thats just dangerous.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>
>>>
>>>
>>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Edward Capriolo

Hive is not the correct tool for every problem. Use the tool that makes the
most sense for your problem and your experience.

Many people like hive because it is generally applicable. In my case study
for the hive book I highlighted many smart capably organizations use hive.

Your argument is totally valid. You like X better because X works for you.
You don't need to 'preach' hear we all know hive has it's limits.

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <ko...@tresata.com> wrote:

> Is the sky the limit? I know udfs can be used inside hive, like lambas
> basically i assume, and i will assume you have something similar for
> aggregations. But that's just abstractions inside a single map or reduce
> phase, pretty low level stuff. What you really need is abstractions around
> many map and reduce phases, because that is the level an algo is expressed
> at.
>
> For example when doing logistic regression you want to be able to do
> something like:
> read("somefile").train(settings).write("model")
> Here train is an eternally defined method that is well tested and could do
> many map and reduce steps internally (or even be defined at a higher level
> and compile into those steps). What is the equivalent in hive? Copy pasting
> crucial parts of the algo around while using udfs is just not the same
> thing in terms of reusability and abstraction. Its the opposite of keeping
> it DRY.
> On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com>
> wrote:
>
>> https://github.com/myui/hivemall
>>
>>
>>
>> as long as you are comfortable with java UDFs, the sky is really the
>> limit...it's not for everyone and spark does have many advantages, but they
>> are two tools that can complement each other in numerous ways.
>>
>>
>>
>> I don't know that there is necessarily a universal "better" for how to
>> use spark as an execution engine (or if spark is necessarily the **best**
>> execution engine for any given hive job).
>>
>>
>>
>> The reality is that once you start factoring in the numerous tuning
>> parameters of the systems and jobs there probably isn't a clear answer.
>> For some queries, the Catalyst optimizer may do a better job...is it going
>> to do a better job with ORC based data? less likely IMO.
>>
>>
>>
>> *From:* Koert Kuipers [mailto:ko...@tresata.com]
>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>
>>
>>
>> yeah but have you ever seen somewhat write a real analytical program in
>> hive? how? where are the basic abstractions to wrap up a large amount of
>> operations (joins, groupby's) into a single function call? where are the
>> tools to write nice unit test for that?
>>
>> for example in spark i can write a DataFrame => DataFrame that internally
>> does many joins, groupBys and complex operations. all unit tested and
>> perfectly re-usable. and in hive? copy paste round sql queries? thats just
>> dangerous.
>>
>>
>>
>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of sql and limited udfs?
>>
>>
>>
>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>>
>> When comparing the performance, you need to do it apple vs apple. In
>> another thread, you mentioned that Hive on Spark is much slower than Spark
>> SQL. However, you configured Hive such that only two tasks can run in
>> parallel. However, you didn't provide information on how much Spark SQL is
>> utilizing. Thus, it's hard to tell whether it's just a configuration
>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>> see the resource usage in YARN resource manage URL.
>>
>> --Xuefu
>>
>>
>>
>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>> Thanks Jeff.
>>
>>
>>
>> Obviously Hive is much more feature rich compared to Spark. Having said
>> that in certain areas for example where the SQL feature is available in
>> Spark, Spark seems to deliver faster.
>>
>>

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Mich Talebzadeh

Hi Edward,

 

There is another angle to it as well. Fit for purpose.

 

We are currently migrating from a propriety DW on SAN to Hive on JBOD. It is 
going smoothly. It will save us $$ in licensing fees in times where the 
technology and storage dollars are at premium.

 

Our DBAs that look after Oracle, SAP ASES and others are comfortable with Hive. 
They can look after the metastore (on Oracle) and working with me for HA for 
metastore and Hive serever2 in line with the standard for other databases.

 

I am sure if we had started with Spark, that would have worked but what the 
hec. We have MongoDB as well independent of HDFS.

 

These arguments about what is better or worse is the one we have had for years 
about Oracle, Sybase, MSSQL etc. I believe Hive is better for us because I 
think Hive. If I was more familiar with Spark, I am sure that would have been 
the opposite.

 

We can go in circles. Religious arguments really.

 

 

HTH,

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: 04 February 2016 17:41
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

Hive is not the correct tool for every problem. Use the tool that makes the 
most sense for your problem and your experience. 

 

Many people like hive because it is generally applicable. In my case study for 
the hive book I highlighted many smart capably organizations use hive. 

Your argument is totally valid. You like X better because X works for you. You 
don't need to 'preach' hear we all know hive has it's limits. 

 

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <ko...@tresata.com 
<mailto:ko...@tresata.com> > wrote:

Is the sky the limit? I know udfs can be used inside hive, like lambas 
basically i assume, and i will assume you have something similar for 
aggregations. But that's just abstractions inside a single map or reduce phase, 
pretty low level stuff. What you really need is abstractions around many map 
and reduce phases, because that is the level an algo is expressed at.

For example when doing logistic regression you want to be able to do something 
like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do many 
map and reduce steps internally (or even be defined at a higher level and 
compile into those steps). What is the equivalent in hive? Copy pasting crucial 
parts of the algo around while using udfs is just not the same thing in terms 
of reusability and abstraction. Its the opposite of keeping it DRY.

On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com 
<mailto:ryan.har...@zionsbancorp.com> > wrote:

https://github.com/myui/hivemall

 

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

 

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

 

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO. 

 

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apa

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers

Is the sky the limit? I know udfs can be used inside hive, like lambas
basically i assume, and i will assume you have something similar for
aggregations. But that's just abstractions inside a single map or reduce
phase, pretty low level stuff. What you really need is abstractions around
many map and reduce phases, because that is the level an algo is expressed
at.

For example when doing logistic regression you want to be able to do
something like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do
many map and reduce steps internally (or even be defined at a higher level
and compile into those steps). What is the equivalent in hive? Copy pasting
crucial parts of the algo around while using udfs is just not the same
thing in terms of reusability and abstraction. Its the opposite of keeping
it DRY.
On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com> wrote:

> https://github.com/myui/hivemall
>
>
>
> as long as you are comfortable with java UDFs, the sky is really the
> limit...it's not for everyone and spark does have many advantages, but they
> are two tools that can complement each other in numerous ways.
>
>
>
> I don't know that there is necessarily a universal "better" for how to use
> spark as an execution engine (or if spark is necessarily the **best**
> execution engine for any given hive job).
>
>
>
> The reality is that once you start factoring in the numerous tuning
> parameters of the systems and jobs there probably isn't a clear answer.
> For some queries, the Catalyst optimizer may do a better job...is it going
> to do a better job with ORC based data? less likely IMO.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Tuesday, February 02, 2016 9:50 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
>
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
> Hive has numerous extension points, you are not boxed in by a long shot.
>
>
>
> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> se

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Elliot West

YAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>> xx |
>>>>>
>>>>> | 10| 99   | 999  | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (76.835 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 10);
>>>>>
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> | dummy.random_string | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | 1 | 0| 0| 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>> xx |
>>>>>
>>>>> | 5 | 0| 4| 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>> xx |
>>>>>
>>>>> | 10| 99   | 999  | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (80.718 seconds)
>>>>>
>>>>>
>>>>>
>>>>> Three runs returning the same rows in 80 seconds.
>>>>>
>>>>>
>>>>>
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>>> date and that causes this lag.
>>>>>
>>>>>
>>>>>
>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>> does not recognize CHAR fields which is a pain.
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>
>>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>
>>>>>  > FROM sales s, times t, channels c
>>>>>
>>>>>  > WHERE s.time_id = t.time_id
>>>>>
>>>>>  > AND   s.channel_id = c.channel_id
>>>>>
>>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>
>>>>>  > ;
>>>>>
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>>
>>>>> .
>>>>>
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>>> *Sent:* 02 February 2016 23:12
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>>
>>>>>
>>>>>
>>>>> I think the diff is not only about which does optimization but more on
>>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>>> offers and these features play out faster. However, Spark SQL is far from
>>>>> offering this parity as far as I know.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>>> optimizer and Spark query engine
>>>>>
>>>>>
>>>>>
>>>>> With spark using Hive metastore, Spark does both the optimization and
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive
>>>>> tables from spark-sql etc
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is this assessment correct?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh

Hi Jeff,

 

I only have a two node cluster. Is there anyway one can simulate additional 
parallel runs in such an environment thus having more than two maps?

 

thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 03 February 2016 02:39
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

Yes, regardless what spark mode you're running in, from Spark AM webui, you 
should be able to see how many task are concurrently running. I'm a little 
surprised to see that your Hive configuration only allows 2 map tasks to run in 
parallel. If your cluster has the capacity, you should parallelize all the 
tasks to achieve optimal performance. Since I don't know your Spark SQL 
configuration, I cannot tell how much parallelism you have over there. Thus, 
I'm not sure if your comparison is valid.

--Xuefu

 

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi Jeff,

 

In below

 

…. You should be able to see the resource usage in YARN resource manage URL.

 

Just to be clear we are talking about Port 8088/cluster?

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: 03 February 2016 00:09

To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com 
<mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM,

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh

Hi all,

 

 

Thanks for all the comments on this thread.

 

The question I put was simply to rectify technically the approaches with Spark 
on Hive metastore and Hive using Spark engine.

 

The fact that we have both the benefits of Hive and Spark is tremendous. They 
both offer in their own way many opportunities.

 

Hive is billed as a Data Warehouse (DW)  on HDFS. In that respect it does a 
good job. Among many it offers many developers who are familiar with SQL to be 
productive immediacy. This should not be underestimated. You can set up your 
copy of your RDBMS table in Hive in no time and use Sqoop to get the table data 
into Hive table practically in one command. For many this is the great 
attraction of Hive that can be summarised as: 

 

* Leverage existing SQL skills on Big Data. 

* You have a choice of metastore for Hive including MySql, Oracle, 
Sybase and others. 

* You have a choice of plug ins for your engine (MR, Spark, Tez)

* Ability to do real time analytics on Hive by sending real time 
transactional movements from RDBMS tables to Hive via the existing replication 
technologies. This is very useful

* Use Sqoop to push data back to DW or RDBMS table

 

One can argue that in a DW the speed is not necessarily the overriding factor. 
It does not matter whether a job finishes in two hours or 2.5 hours. Granted 
some commercial DW solutions can do the job much faster but at what cost in 
terms of multiplexing and paying the licensing fees. Hive is an attractive 
proposition here. 

 

I can live with most of Hive shortcomings but would like to see the following:

 

* Hive has the ability to create multiple EXTERRNAL index types on 
columns. But they are never used. It would be great if they can be incorporated 
in what they are supposed to use. That will speed up processing

* It will be awesome to have the ability to have some dialect of isql, 
PL/SQL capabilities that allow local variables, conditional statements etc to 
be used in Hive much like other DW without using Shell scripting, Pig and other 
tools 

 

 

Spark is great especially for those familiar with Scala and others languages 
(additional skill set) that can leverage Spark shell. However, again it comes 
at a price of having available memory which is not always the case. Point 
queries are great. However, if you bring back tons of rows then the performance 
degrades as it has to spill to disk. 

 

Big Data space is getting crowded with a lot of products and auxiliary 
products. I can see the potential of Spark for other exploratory work. Having 
said that in fairness, Hive as a Data Warehouse  does what it says on the tin.

 

 

Thanks again

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 03 February 2016 09:25
To: user@hive.apache.org
Subject: RE: Hive on Spark Engine versus Spark using Hive metastore

 

Hi Jeff,

 

I only have a two node cluster. Is there anyway one can simulate additional 
parallel runs in such an environment thus having more than two maps?

 

thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publi

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh

ted (85.67 seconds)

 

3)Spark on Hive engine completes in 267 sec

spark-sql> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
TotalSales

 > FROM sales s, times t, channels c

 > WHERE s.time_id = t.time_id

 > AND   s.channel_id = c.channel_id

 > GROUP BY t.calendar_month_desc, c.channel_desc

 > ;

Time taken: 267.138 seconds, Fetched 150 row(s)

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 03 February 2016 16:21
To: user@hive.apache.org
Subject: RE: Hive on Spark Engine versus Spark using Hive metastore

 

OK thanks. These are my new ENV settings based upon the availability of 
resources

 

export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers (Default: 1).

export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) 
(Default: 1G)

export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) (Default: 
512 Mb)

 

These are the new runs after these settings:

 

Spark on Hive (3 consecutive runs)

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 47.987 seconds, Fetched 3 row(s)

 

Around 48 seconds

 

Hive on Spark 1.3.1

 

0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

INFO  : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18

INFO  : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18

INFO  : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18

INFO  : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18

INFO  : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18

INFO  : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished

INFO  : Status: Finished successfully in 36.88 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUD

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Edward Capriolo

--+-++--+
>>>>>>
>>>>>> 3 rows selected (82.66 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>> in (1, 5, 10);
>>>>>>
>>>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>> dummy.randomised  | dummy.random_string |
>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> | 1 | 0| 0| 63
>>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>>> xx |
>>>>>>
>>>>>> | 5 | 0| 4| 31
>>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>>> xx |
>>>>>>
>>>>>> | 10| 99   | 999  | 188
>>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>>> xx |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> 3 rows selected (76.835 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>> in (1, 5, 10);
>>>>>>
>>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>> dummy.randomised  | dummy.random_string |
>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-----+-++--+
>>>>>>
>>>>>> | 1 | 0| 0| 63
>>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>>> xx |
>>>>>>
>>>>>> | 5 | 0| 4| 31
>>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>>> xx |
>>>>>>
>>>>>> | 10| 99   | 999  | 188
>>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>>> xx |
>>>>>>
>>>>>>
>>>>>> +---+--+--+---+-+-++--+
>>>>>>
>>>>>> 3 rows selected (80.718 seconds)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Three runs returning the same rows in 80 seconds.
>>>>>>
>>>>>>
>>>>>>
>>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out
>>>>>> of date and that causes this lag.
>>>>>>
>>>>>>
>>>>>>
>>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>>> does not recognize CHAR fields which is a pain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Koert Kuipers

+-++--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  | dummy.random_string 
>>>>>>> |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | 1 | 0| 0|
>>>>>>> 63| rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |  1  | xx |
>>>>>>>
>>>>>>> | 5 | 0| 4|
>>>>>>> 31| vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |  5  | xx |
>>>>>>>
>>>>>>> | 10| 99   | 999  |
>>>>>>> 188   | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> | 10  | xx |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> 3 rows selected (82.66 seconds)
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>>> in (1, 5, 10);
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  | dummy.random_string 
>>>>>>> |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | 1 | 0| 0|
>>>>>>> 63| rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi
>>>>>>> |  1  | xx |
>>>>>>>
>>>>>>> | 5 | 0| 4|
>>>>>>> 31| vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA
>>>>>>> |  5  | xx |
>>>>>>>
>>>>>>> | 10| 99   | 999  |
>>>>>>> 188   | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe
>>>>>>> | 10  | xx |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> 3 rows selected (76.835 seconds)
>>>>>>>
>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id
>>>>>>> in (1, 5, 10);
>>>>>>>
>>>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  |
>>>>>>> dummy.randomised  | dummy.random_string 
>>>>>>> |
>>>>>>> dummy.small_vc  | dummy.padding  |
>>>>>>>
>>>>>>>
>>>>>>> +---+--+--+---+-+-++--+
>>>>>>>
>>>>

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Mich Talebzadeh

OK thanks. These are my new ENV settings based upon the availability of 
resources

 

export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers (Default: 1).

export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) 
(Default: 1G)

export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) (Default: 
512 Mb)

 

These are the new runs after these settings:

 

Spark on Hive (3 consecutive runs)

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 47.987 seconds, Fetched 3 row(s)

 

Around 48 seconds

 

Hive on Spark 1.3.1

 

0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

INFO  : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18

INFO  : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18

INFO  : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18

INFO  : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18

INFO  : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18

INFO  : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished

INFO  : Status: Finished successfully in 36.88 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
xx |

| 10| 99   | 999  | 188   | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  | 
xx |

+---+--+--+---+-+-++--+

3 rows selected (37.161 seconds)

 

Around 37 seconds

 

Interesting results

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 03 February 2016 12:47
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Stephen Sprague

3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 100%,
> Cumulative CPU 50.01 sec
>
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 10 msec
>
> INFO  : Ended Job = job_1454534517374_0002
>
> ++-+-+--+
>
> | t.calendar_month_desc  | c.channel_desc  | totalsales  |
>
> ++-+-+--+
>
> ++-+-+--+
>
> 150 rows selected (85.67 seconds)
>
>
>
> *3)**Spark on Hive engine completes in 267 sec*
>
> spark-sql> SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Time taken: 267.138 seconds, Fetched 150 row(s)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 03 February 2016 16:21
> *To:* user@hive.apache.org
> *Subject:* RE: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> OK thanks. These are my new ENV settings based upon the availability of
> resources
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
> These are the new runs after these settings:
>
>
>
> *Spark on Hive (3 consecutive runs)*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 47.987 seconds, Fetched 3 row(s)
>
>
>
> Around 48 seconds
>
>
>
> *Hive on Spark 1.3.1*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in
> (1, 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[2] stages:
>
> INFO  : 2
>
> INFO  :
>
> Status: Running (Hive on Spark job[2])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_Stag

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang

   | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (76.835 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 10);
>
> INFO  : Status: Finished successfully in 80.54 seconds
>
>
> +---+--+--+---+-+-++--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> | dummy.random_string | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (80.718 seconds)
>
>
>
> Three runs returning the same rows in 80 seconds.
>
>
>
> It is possible that My Spark engine with Hive is 1.3.1 which is out of
> date and that causes this lag.
>
>
>
> There are certain queries that one cannot do with Spark. Besides it does
> not recognize CHAR fields which is a pain.
>
>
>
> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>
>  > SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>
> .
>
> You are likely trying to use an unsupported Hive feature.";
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries no

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang

I think the diff is not only about which does optimization but more on
feature parity. Hive on Spark offers all functional features that Hive
offers and these features play out faster. However, Spark SQL is far from
offering this parity as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh

y in 80.54 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
xx |

| 10| 99   | 999  | 188   | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  | 
xx |

+---+--+--+---+-+-++--+

3 rows selected (80.718 seconds)

Three runs returning the same rows in 80 seconds. 

It is possible that My Spark engine with Hive is 1.3.1 which is out of date and 
that causes this lag. 

There are certain queries that one cannot do with Spark. Besides it does not 
recognize CHAR fields which is a pain.

spark-sql> CREATE TEMPORARY TABLE tmp AS

 > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
TotalSales

 > FROM sales s, times t, channels c

 > WHERE s.time_id = t.time_id

 > AND   s.channel_id = c.channel_id

 > GROUP BY t.calendar_month_desc, c.channel_desc

 > ;

Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7

.

You are likely trying to use an unsupported Hive feature.";

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 02 February 2016 23:12
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

I think the diff is not only about which does optimization but more on feature 
parity. Hive on Spark offers all functional features that Hive offers and these 
features play out faster. However, Spark SQL is far from offering this parity 
as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

My understanding is that with Hive on Spark engine, one gets the Hive optimizer 
and Spark query engine

With spark using Hive metastore, Spark does both the optimization and query 
engine. The only value add is that one can access the underlying Hive tables 
from spark-sql etc

Is this assessment correct?

Thanks

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The informati

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Philip Lee

>From my experience, spark sql has its own optimizer to support Hive query
and metastore. After 1.5.2 spark, its optimizer is named catalyst.
2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성:

> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers all functional features that Hive
> offers and these features play out faster. However, Spark SQL is far from
> offering this parity as far as I know.
>
> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>>
>>
>> My understanding is that with Hive on Spark engine, one gets the Hive
>> optimizer and Spark query engine
>>
>>
>>
>> With spark using Hive metastore, Spark does both the optimization and
>> query engine. The only value add is that one can access the underlying Hive
>> tables from spark-sql etc
>>
>>
>>
>>
>>
>> Is this assessment correct?
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>
>

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh

Hi,

 

Are you referring to spark-shell with Scala, Python and others? 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: 03 February 2016 00:09
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com 
<mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+---+--+--+---+-+-++--+

| dum

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh

Hi Jeff,

 

In below

 

…. You should be able to see the resource usage in YARN resource manage URL.

 

Just to be clear we are talking about Port 8088/cluster?

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: 03 February 2016 00:09
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com 
<mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark j

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers

 |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (82.66 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  : Status: Finished successfully in 76.67 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | 1 | 0| 0| 63|
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>> xx |
>>
>> | 5 | 0| 4| 31|
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>> xx |
>>
>> | 10| 99   | 999  | 188   |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (76.835 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  : Status: Finished successfully in 80.54 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | 1 | 0| 0| 63|
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>> xx |
>>
>> | 5 | 0| 4| 31|
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>> xx |
>>
>> | 10| 99   | 999  | 188   |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (80.718 seconds)
>>
>>
>>
>> Three runs returning the same rows in 80 seconds.
>>
>>
>>
>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>> date and that causes this lag.
>>
>>
>>
>> There are certain queries that one cannot do with Spark. Besides it does
>> not recognize CHAR fields which is a pain.
>>
>>
>>
>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>
>>  > SELECT t.calendar_month_desc, c.channel_desc,
>> SUM(s.amount_sold) AS TotalSales
>>
>>  > FROM sales s, times t, channels c
>>
>>  > WHERE s.time_id = t.time_id
>>
>>  > AND   s.channel_id = c.channel_id
>>
>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>
>>  > ;
>>
>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>
>> .
>>
>> You are likely trying to use an unsupported Hive feature.";
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>&

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Ryan Harris

https://github.com/myui/hivemall

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO.

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

yeah but have you ever seen somewhat write a real analytical program in hive? 
how? where are the basic abstractions to wrap up a large amount of operations 
(joins, groupby's) into a single function call? where are the tools to write 
nice unit test for that?
for example in spark i can write a DataFrame => DataFrame that internally does 
many joins, groupBys and complex operations. all unit tested and perfectly 
re-usable. and in hive? copy paste round sql queries? thats just dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
<edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> wrote:
Hive has numerous extension points, you are not boxed in by a long shot.


On Tuesday, February 2, 2016, Koert Kuipers 
<ko...@tresata.com<mailto:ko...@tresata.com>> wrote:
uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.
--Xuefu

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
Thanks Jeff.

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

This may be:


1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

Spark 1.5.2 on Hive 1.2.1 Metastore


spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.805 seconds, Fetched 3 row(s)
spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.358 seconds, Fetched 3 row(s)
spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.563 seconds, Fetched 3 row(s)

So three runs returning three rows just over 50 seconds

Hive 1.2.1 on spark 1.3.1 execution engine

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);
INFO  :
Query Hive on Spark job[4] stages:
INFO  : 4
INFO  :
Status: Running (Hive on Spark job[4])
INFO  : Status: Finished successfully in 82.49 seconds
+---+--+--+---+-+-++--+
| dummy.id<http://dummy.id>  | dummy.clustered  | dummy.sc

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Jörn Franke

-+--+---+-+-++--+
>>>>> 
>>>>> 3 rows selected (76.835 seconds)
>>>>> 
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in 
>>>>> (1, 5, 10);
>>>>> 
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>> 
>>>>> +---+--+--+---+-+-++--+
>>>>> 
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |   
>>>>>   dummy.random_string | dummy.small_vc  | 
>>>>> dummy.padding  |
>>>>> 
>>>>> +---+--+--+---+-+-++--+
>>>>> 
>>>>> | 1 | 0| 0| 63| 
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
>>>>> xx |
>>>>> 
>>>>> | 5 | 0| 4| 31| 
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
>>>>> xx |
>>>>> 
>>>>> | 10| 99   | 999  | 188   | 
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000  | 
>>>>> xx |
>>>>> 
>>>>> +---+--+--+---+-+-++--+
>>>>> 
>>>>> 3 rows selected (80.718 seconds)
>>>>> 
>>>>>  
>>>>> 
>>>>> Three runs returning the same rows in 80 seconds.
>>>>> 
>>>>>  
>>>>> 
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of 
>>>>> date and that causes this lag.
>>>>> 
>>>>>  
>>>>> 
>>>>> There are certain queries that one cannot do with Spark. Besides it does 
>>>>> not recognize CHAR fields which is a pain.
>>>>> 
>>>>>  
>>>>> 
>>>>> spark-sql> CREATE TEMPORARY TABLE tmp AS
>>>>> 
>>>>>  > SELECT t.calendar_month_desc, c.channel_desc, 
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>> 
>>>>>  > FROM sales s, times t, channels c
>>>>> 
>>>>>  > WHERE s.time_id = t.time_id
>>>>> 
>>>>>  > AND   s.channel_id = c.channel_id
>>>>> 
>>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>> 
>>>>>  > ;
>>>>> 
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>> 
>>>>> .
>>>>> 
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>>  
>>>>> 
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> 
>>>>>  
>>>>> 
>>>>> Sybase ASE 15 Gold Medal Award 2008
>>>>> 
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>> 
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>> 
>>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 
>>>>> 15", ISBN 978-0-9563693-0-7.
>>>>> 
>>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>>>>> 978-0-9759693-0-4
>>>>> 
>>>>> Publications due shortly:
>>>>> 
>>>>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>>>>> 978-0-9563693-3-8
>>>>> 
>>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
>>>>> volume one out shortly
>>>>> 
>>>>>  
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>>> message is for the designated recipient only, if you are not the intended 
>>>>> recipient, you should destroy it immediately. Any information in this 
>>>>> message shall not be understood as given or endorsed by Peridale 
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>>> subsidiaries nor their employees accept any responsibility.
>>>>> 
>>>>>  
>>>>> 
>>>>> From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
>>>>> Sent: 02 February 2016 23:12
>>>>> To: user@hive.apache.org
>>>>> Subject: Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>> 
>>>>>  
>>>>> 
>>>>> I think the diff is not only about which does optimization but more on 
>>>>> feature parity. Hive on Spark offers all functional features that Hive 
>>>>> offers and these features play out faster. However, Spark SQL is far from 
>>>>> offering this parity as far as I know.
>>>>> 
>>>>>  
>>>>> 
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>  
>>>>> 
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive 
>>>>> optimizer and Spark query engine
>>>>> 
>>>>>  
>>>>> 
>>>>> With spark using Hive metastore, Spark does both the optimization and 
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive tables from spark-sql etc
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Is this assessment correct?
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>  
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>>  
>>>>> 
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> 
>>>>>  
>>>>> 
>>>>> Sybase ASE 15 Gold Medal Award 2008
>>>>> 
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>> 
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>> 
>>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 
>>>>> 15", ISBN 978-0-9563693-0-7.
>>>>> 
>>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>>>>> 978-0-9759693-0-4
>>>>> 
>>>>> Publications due shortly:
>>>>> 
>>>>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>>>>> 978-0-9563693-3-8
>>>>> 
>>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
>>>>> volume one out shortly
>>>>> 
>>>>>  
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>>> message is for the designated recipient only, if you are not the intended 
>>>>> recipient, you should destroy it immediately. Any information in this 
>>>>> message shall not be understood as given or endorsed by Peridale 
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>>> subsidiaries nor their employees accept any responsibility.
>>>>> 
>> 
>> 
>> -- 
>> Sorry this was sent from mobile. Will do less grammar and spell check than 
>> usual.
>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers

yes. the ability to start with sql but when needed expand into more full
blown programming languages, machine learning etc. is a huge plus. after
all this is a cluster, and just querying or extracting data to move it off
the cluster into some other analytics tool is going to be very inefficient
and defeats the purpose to some extend of having a cluster. so you want to
have a capability to do more than queries and etl. and spark is that
ticket. hive is simply not. well not for anything somewhat complex anyhow.


On Tue, Feb 2, 2016 at 8:06 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Hi,
>
>
>
> Are you referring to spark-shell with Scala, Python and others?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers

---+-++--+
>>>>
>>>> | 1 | 0| 0| 63|
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>> xx |
>>>>
>>>> | 5 | 0| 4| 31|
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>> xx |
>>>>
>>>> | 10| 99   | 999  | 188   |
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>> xx |
>>>>
>>>>
>>>> +---+--+--+---+-+-++--+
>>>>
>>>> 3 rows selected (80.718 seconds)
>>>>
>>>>
>>>>
>>>> Three runs returning the same rows in 80 seconds.
>>>>
>>>>
>>>>
>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>> date and that causes this lag.
>>>>
>>>>
>>>>
>>>> There are certain queries that one cannot do with Spark. Besides it
>>>> does not recognize CHAR fields which is a pain.
>>>>
>>>>
>>>>
>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>
>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>> SUM(s.amount_sold) AS TotalSales
>>>>
>>>>  > FROM sales s, times t, channels c
>>>>
>>>>  > WHERE s.time_id = t.time_id
>>>>
>>>>  > AND   s.channel_id = c.channel_id
>>>>
>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>
>>>>  > ;
>>>>
>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>
>>>> .
>>>>
>>>> You are likely trying to use an unsupported Hive feature.";
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>> volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>> nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>> *Sent:* 02 February 2016 23:12
>>>> *To:* user@hive.apache.org
>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>
>>>>
>>>>
>>>> I think the diff is not only about which does optimization but more on
>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>> offers and these features play out faster. However, Spark SQL is far from
>>>> offering this parity as far as I know.
>>>>
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>> optimizer and Spark query engine
>>>>
>>>>
>>>>
>>>> With spark using Hive metastore, Spark does both the optimization and
>>>> query engine. The only value add is that one can access the underlying Hive
>>>> tables from spark-sql etc
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Is this assessment correct?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>> volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>> nor their employees accept any responsibility.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers

;> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (76.835 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 10);
>>>>>
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> | dummy.random_string | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> | 1 | 0| 0| 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>>>> xx |
>>>>>
>>>>> | 5 | 0| 4| 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>>>> xx |
>>>>>
>>>>> | 10| 99   | 999  | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>>>> xx |
>>>>>
>>>>>
>>>>> +---+--+--+---+-+-++--+
>>>>>
>>>>> 3 rows selected (80.718 seconds)
>>>>>
>>>>>
>>>>>
>>>>> Three runs returning the same rows in 80 seconds.
>>>>>
>>>>>
>>>>>
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>>> date and that causes this lag.
>>>>>
>>>>>
>>>>>
>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>> does not recognize CHAR fields which is a pain.
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>
>>>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>
>>>>>  > FROM sales s, times t, channels c
>>>>>
>>>>>  > WHERE s.time_id = t.time_id
>>>>>
>>>>>  > AND   s.channel_id = c.channel_id
>>>>>
>>>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>
>>>>>  > ;
>>>>>
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>>
>>>>> .
>>>>>
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>>> *Sent:* 02 February 2016 23:12
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>>
>>>>>
>>>>>
>>>>> I think the diff is not only about which does optimization but more on
>>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>>> offers and these features play out faster. However, Spark SQL is far from
>>>>> offering this parity as far as I know.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>>> optimizer and Spark query engine
>>>>>
>>>>>
>>>>>
>>>>> With spark using Hive metastore, Spark does both the optimization and
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive
>>>>> tables from spark-sql etc
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is this assessment correct?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Edward Capriolo

t;>
>>> There are certain queries that one cannot do with Spark. Besides it does
>>> not recognize CHAR fields which is a pain.
>>>
>>>
>>>
>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>
>>>  > SELECT t.calendar_month_desc, c.channel_desc,
>>> SUM(s.amount_sold) AS TotalSales
>>>
>>>  > FROM sales s, times t, channels c
>>>
>>>  > WHERE s.time_id = t.time_id
>>>
>>>  > AND   s.channel_id = c.channel_id
>>>
>>>  > GROUP BY t.calendar_month_desc, c.channel_desc
>>>
>>>  > ;
>>>
>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>
>>> .
>>>
>>> You are likely trying to use an unsupported Hive feature.";
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com
>>> <javascript:_e(%7B%7D,'cvml','xzh...@cloudera.com');>]
>>> *Sent:* 02 February 2016 23:12
>>> *To:* user@hive.apache.org
>>> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> I think the diff is not only about which does optimization but more on
>>> feature parity. Hive on Spark offers all functional features that Hive
>>> offers and these features play out faster. However, Spark SQL is far from
>>> offering this parity as far as I know.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk
>>> <javascript:_e(%7B%7D,'cvml','m...@peridale.co.uk');>> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>> optimizer and Spark query engine
>>>
>>>
>>>
>>> With spark using Hive metastore, Spark does both the optimization and
>>> query engine. The only value add is that one can access the underlying Hive
>>> tables from spark-sql etc
>>>
>>>
>>>
>>>
>>>
>>> Is this assessment correct?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>>
>>>
>>
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang

Yes, regardless what spark mode you're running in, from Spark AM webui, you
should be able to see how many task are concurrently running. I'm a little
surprised to see that your Hive configuration only allows 2 map tasks to
run in parallel. If your cluster has the capacity, you should parallelize
all the tasks to achieve optimal performance. Since I don't know your Spark
SQL configuration, I cannot tell how much parallelism you have over there.
Thus, I'm not sure if your comparison is valid.

--Xuefu

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>

Re: Hive on Spark task running time is too long

2016-01-11 Thread Xuefu Zhang

You should check executor log to find out why it failed. There might have
more explanation.

--Xuefu

On Sun, Jan 10, 2016 at 11:21 PM, Jone Zhang  wrote:

> *I have submited a application many times.*
> *Most of applications running correctly.See attach 1.*
> *But one of the them breaks as expected.See attach 2.1 and 2.2.*
>
> *Why a small data size task running so long, and can't find any helpful
> information in yarn logs.*
>
> *Part of the log information is as follows*
> 16/01/11 12:45:19 INFO storage.BlockManagerMasterEndpoint: Trying to
> remove executor 1 from BlockManagerMaster.
> 16/01/11 12:45:19 INFO storage.BlockManagerMasterEndpoint: Removing block
> manager BlockManagerId(1, 10.226.148.160, 44366)
> 16/01/11 12:45:19 INFO storage.BlockManagerMaster: Removed 1 successfully
> in removeExecutor
> 16/01/11 12:50:32 INFO storage.BlockManagerInfo: Removed
> broadcast_2_piece0 on 10.219.58.123:39594 in memory (size: 92.2 KB, free:
> 441.4 MB)
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 2 with
> no recent heartbeats: 604535 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 2
> (already removed): Executor heartbeat timed out after 604535 ms
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 1 with
> no recent heartbeats: 609228 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 1
> (already removed): Executor heartbeat timed out after 609228 ms
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 4 with
> no recent heartbeats: 615098 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 4
> (already removed): Executor heartbeat timed out after 615098 ms
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 3 with
> no recent heartbeats: 616730 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 2
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 3
> (already removed): Executor heartbeat timed out after 616730 ms
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 2 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 1
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 1 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 4
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 4 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 3
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 3 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 14:29:55 WARN client.RemoteDriver: Shutting down driver because
> RPC channel was closed.
> 16/01/11 14:29:55 INFO client.RemoteDriver: Shutting down remote driver.
> 16/01/11 14:29:55 INFO scheduler.DAGScheduler: Asked to cancel job 1
> 16/01/11 14:29:55 INFO client.RemoteDriver: Failed to run job
> 2fbbb881-988b-4454-ad9e-a20783aaf38e
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:503)
> at
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:371)
> at
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Cancelling stage 2
> 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Removed TaskSet 2.0,
> whose tasks have all completed, from pool
> 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Stage 2 was cancelled
> 16/01/11 14:29:55 INFO scheduler.DAGScheduler: ShuffleMapStage 2
> (mapPartitionsToPair at MapTran.java:31) failed in 6278.824 s
> 16/01/11 14:29:55 INFO handler.ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/metrics/json,null}
> 16/01/11 14:29:55 INFO handler.ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/01/11 14:29:55 INFO handler.ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/api,null}
>

Re: Hive on Spark throw java.lang.NullPointerException

2015-12-18 Thread Xuefu Zhang

Could you create a JIRA with repro case?

Thanks,
Xuefu

On Thu, Dec 17, 2015 at 9:21 PM, Jone Zhang  wrote:

> *My query is *
> set hive.execution.engine=spark;
> select
>
> t3.pcid,channel,version,ip,hour,app_id,app_name,app_apk,app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num,
> (case when t4.cnt is null then 0 else 1 end) as is_evil
> from
> (select /*+mapjoin(t2)*/
> pcid,channel,version,ip,hour,
> (case when t2.app_id is null then t1.app_id else t2.app_id end) as app_id,
> t2.name as app_name,
> app_apk,
>
> app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num
> from
> t_ed_soft_downloadlog_molo t1 left outer join t_rd_soft_app_pkg_name t2 on
> (lower(t1.app_apk) = lower(t2.package_id) and t1.ds = 20151217 and t2.ds =
> 20151217)
> where
> t1.ds = 20151217) t3
> left outer join
> (
> select pcid,count(1) cnt  from t_ed_soft_evillog_molo where ds=20151217
>  group by pcid
> ) t4
> on t3.pcid=t4.pcid;
>
>
> *And the error log is *
> 2015-12-18 08:10:18,685 INFO  [main]: spark.SparkMapJoinOptimizer
> (SparkMapJoinOptimizer.java:process(79)) - Check if it can be converted to
> map join
> 2015-12-18 08:10:18,686 ERROR [main]: ql.Driver
> (SessionState.java:printError(966)) - FAILED: NullPointerException null
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedParentMapJoinSize(SparkMapJoinOptimizer.java:312)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedMapJoinSize(SparkMapJoinOptimizer.java:292)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getMapJoinConversionInfo(SparkMapJoinOptimizer.java:271)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.process(SparkMapJoinOptimizer.java:80)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkJoinOptimizer.process(SparkJoinOptimizer.java:58)
> at
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:92)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:97)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:81)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:135)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:112)
> at
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:128)
> at
> org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:102)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10238)
> at
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:210)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:233)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:425)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> at
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1123)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1171)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1060)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1050)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:208)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:160)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:447)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357)
> at
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:795)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:767)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:704)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>
>
> *Some properties on hive-site.xml is *
> 
>hive.ignore.mapjoin.hint
>false
> 
> 
> hive.auto.convert.join
> true
> 
> 
>hive.auto.convert.join.noconditionaltask
>true
> 
>
>
> *The error relevant code is *
> long mjSize = ctx.getMjOpSizes().get(op);
> *I think it should be checked whether or not * ctx.getMjOpSizes().get(op) *is
> null.*
>
> *Of course, more strict logic need to you to decide.*
>
>
> *Thanks.*
> *Best Wishes.*
>

RE: hive on spark

2015-12-18 Thread Mich Talebzadeh

Hi,
 
Your statement
 
“I read that this is due to something not being compiled against the correct 
hadoop version.
my main question what is the binary/jar/file that can cause this?”
 

 

I believe this is the file in $HIVE_HOME/lib called 
spark-assembly-1.3.1-hadoop2.4.0.jar which you need to build it from the source 
code for Spark 1.3.1 excluding Hive jars

 

Something like below

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

Then extract the above file and copy over to $HIVE_HOME/lib

 

Example

 

hive> set spark.home=/usr/lib/spark-1.3.1-bin-hadoop2.6;  -- This is the 
precompiled binary installation fot Spark 1.3.1

hive> set hive.execution.engine=spark;

hive> set spark.master=yarn-client;

hive> select count(1) from t;

Query ID = hduser_20151218212056_4e1faef5-93bd-4e18-9375-659220d67530

Total jobs = 1

Launching Job 1 out of 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=

In order to set a constant number of reducers:

  set mapreduce.job.reduces=

Starting Spark Job = 35c78523-4a36-45e5-95f1-01052985ff4b

 

Query Hive on Spark job[0] stages:

0

1

 

Status: Running (Hive on Spark job[0])

Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

2015-12-18 21:21:36,852 Stage-0_0: 0/256Stage-1_0: 0/1

2015-12-18 21:21:39,900 Stage-0_0: 0/256Stage-1_0: 0/1

2015-12-18 21:21:41,914 Stage-0_0: 0(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:44,933 Stage-0_0: 0(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:45,941 Stage-0_0: 1(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:46,952 Stage-0_0: 3(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:47,963 Stage-0_0: 4(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:48,969 Stage-0_0: 6(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:49,977 Stage-0_0: 7(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:50,991 Stage-0_0: 9(+2)/256Stage-1_0: 0/1

2015-12-18 21:21:52,001 Stage-0_0: 10(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:53,013 Stage-0_0: 11(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:54,022 Stage-0_0: 13(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:55,030 Stage-0_0: 15(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:56,038 Stage-0_0: 18(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:57,053 Stage-0_0: 52(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:58,058 Stage-0_0: 90(+2)/256   Stage-1_0: 0/1

2015-12-18 21:21:59,066 Stage-0_0: 129(+2)/256  Stage-1_0: 0/1

2015-12-18 21:22:00,075 Stage-0_0: 176(+2)/256  Stage-1_0: 0/1

2015-12-18 21:22:01,083 Stage-0_0: 224(+2)/256  Stage-1_0: 0/1

2015-12-18 21:22:02,111 Stage-0_0: 256/256 Finished Stage-1_0: 0(+1)/1

2015-12-18 21:22:03,117 Stage-0_0: 256/256 Finished Stage-1_0: 1/1 Finished

Status: Finished successfully in 62.46 seconds

OK

2074897

Time taken: 66.434 seconds, Fetched: 1 row(s)

 

 

HTH

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ophir Etzion [mailto:op...@foursquare.com] 
Sent: 18 December 2015 20:46
To: user@hive.apache.org; u...@spark.apache.org
Subject: hive on spark

 

During spark-submit when running hive on spark I get:
 
Exception in thread "main" java.util.ServiceConfigurationError: 
org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.HftpFileSystem 
could not be instantiated
 
Caused by: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.fs.DelegationTokenRenewer.(Ljava/lang/Class;)V from 
class org.apache.hadoop.hdfs.HftpFileSystem
 
I managed to make hive on spark work on a staging cluster I

Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-17 Thread Xuefu Zhang

m/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Ltd, its subsidiaries nor their
>>> employees accept any responsibility.
>>>
>>>
>>>
>>> *From:* Ophir Etzion [mailto:op...@foursquare.com]
>>> *Sent:* 15 December 2015 22:42
>>> *To:* user@hive.apache.org
>>> *Cc:* u...@spark.apache.org
>>> *Subject:* Re: Hive on Spark - Error: Child process exited before
>>> connecting back
>>>
>>>
>>>
>>> Hi,
>>>
>>> the versions are spark 1.3.0 and hive 1.1.0 as part of cloudera 5.4.3.
>>>
>>> I find it weird that it would work only on the version you mentioned as
>>> there is documentation (not good documentation but still..) on how to do it
>>> with cloudera that packages different versions.
>>>
>>> Thanks for the answer though.
>>>
>>> why would spark 1.5.2 specifically would not work with hive?
>>>
>>>
>>>
>>> Ophir
>>>
>>>
>>>
>>> On Tue, Dec 15, 2015 at 5:33 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> The only version that I have managed to run Hive using Spark engine is
>>> Spark 1.3.1 on Hive 1.2.1
>>>
>>>
>>>
>>> Can you confirm the version of Spark you are running?
>>>
>>>
>>>
>>> FYI, Spark 1.5.2 will not work with Hive.
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> Mich Talebzadeh
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Ltd, its subsidiaries nor their
>>> employees accept any responsibility.
>>>
>>>
>>>
>>> *From:* Ophir Etzion [mailto:op...@foursquare.com]
>>> *Sent:* 15 December 2015 22:27
>>> *To:* u...@spark.apache.org; user@hive.apache.org
>>> *Subject:* Hive on Spark - Error: Child process exit

Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Ophir Etzion

Hi,

the versions are spark 1.3.0 and hive 1.1.0 as part of cloudera 5.4.3.

I find it weird that it would work only on the version you mentioned as
there is documentation (not good documentation but still..) on how to do it
with cloudera that packages different versions.

Thanks for the answer though.

why would spark 1.5.2 specifically would not work with hive?

Ophir

On Tue, Dec 15, 2015 at 5:33 PM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> The only version that I have managed to run Hive using Spark engine is
> Spark 1.3.1 on Hive 1.2.1
>
>
>
> Can you confirm the version of Spark you are running?
>
>
>
> FYI, Spark 1.5.2 will not work with Hive.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Ophir Etzion [mailto:op...@foursquare.com]
> *Sent:* 15 December 2015 22:27
> *To:* u...@spark.apache.org; user@hive.apache.org
> *Subject:* Hive on Spark - Error: Child process exited before connecting
> back
>
>
>
> Hi,
>
>
>
> when trying to do Hive on Spark on CDH5.4.3 I get the following error when
> trying to run a simple query using spark.
>
> I've tried setting everything written here (
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> as well as what the cdh recommends.
>
> any one encountered this as well? (searching for it didn't help much)
>
> the error:
>
> ERROR : Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
>
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
>
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
>
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
>
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
>
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
>
> at
> org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
>
> at
> org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)
>
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)
>
> at
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
>
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.RuntimeException:
>

RE: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Mich Talebzadeh

Hi,

 

The only version that I have managed to run Hive using Spark engine is Spark 
1.3.1 on Hive 1.2.1

 

Can you confirm the version of Spark you are running?

 

FYI, Spark 1.5.2 will not work with Hive.

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ophir Etzion [mailto:op...@foursquare.com] 
Sent: 15 December 2015 22:27
To: u...@spark.apache.org; user@hive.apache.org
Subject: Hive on Spark - Error: Child process exited before connecting back

 

Hi,

 

when trying to do Hive on Spark on CDH5.4.3 I get the following error when 
trying to run a simple query using spark.

I've tried setting everything written here 
(https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
 as well as what the cdh recommends.

any one encountered this as well? (searching for it didn't help much)

the error:

ERROR : Failed to execute spark task, with exception 
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark 
client.)'

org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.

at 
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)

at 
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)

at 
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)

at 
org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)

at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)

at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)

at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)

at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)

at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)

at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)

at 
org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)

at 
org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)

at 
org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)

at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: Cancel client 
'2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before 
connecting back

at com.google.common.base.Throwables.propagate(Throwables.java:156)

at 
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:109)

at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)

at 
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:91)

at

RE: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Mich Talebzadeh

To answer your point:

 

“why would spark 1.5.2 specifically would not work with hive?”

 

Because I tried Spark 1.5.2 and it did not work and unfortunately the only 
version seem to work (albeit requires messaging around) is version 1.3.1 of 
Spark.

 

Look at the threads on “Managed to make Hive run on Spark engine” in 
user@hive.apache.org <mailto:user@hive.apache.org> 

 

 

HTH,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ophir Etzion [mailto:op...@foursquare.com] 
Sent: 15 December 2015 22:42
To: user@hive.apache.org
Cc: u...@spark.apache.org
Subject: Re: Hive on Spark - Error: Child process exited before connecting back

 

Hi,

the versions are spark 1.3.0 and hive 1.1.0 as part of cloudera 5.4.3.

I find it weird that it would work only on the version you mentioned as there 
is documentation (not good documentation but still..) on how to do it with 
cloudera that packages different versions.

Thanks for the answer though.

why would spark 1.5.2 specifically would not work with hive?

 

Ophir

 

On Tue, Dec 15, 2015 at 5:33 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

 

The only version that I have managed to run Hive using Spark engine is Spark 
1.3.1 on Hive 1.2.1

 

Can you confirm the version of Spark you are running?

 

FYI, Spark 1.5.2 will not work with Hive.

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ophir Etzion [mailto:op...@foursquare.com <mailto:op...@foursquare.com> ] 
Sent: 15 December 2015 22:27
To: u...@spark.apache.org <mailto:u...@spark.apache.org> ; user@hive.apache.org 
<mailto:user@hive.apache.org> 
Subject: Hive on Spark - Error: Child process exited before connecting back

 

Hi,

 

when trying to do Hive on Spark on CDH5.4.3 I get the following error when 
trying to run a simple query using spark.

I've tried setting everything written here 
(https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
 as well as what the cdh recommends.

any one encountered this as well? (searching for it didn't help much)

the error:

ERROR : Failed to execute spark task, with exception 
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark 
client.)'

org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.

at 
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)

at 
org.apache.hadoop.hive.q

Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Xuefu Zhang

Ophir,

Can you provide your hive.log here? Also, have you checked your spark
application log?

When this happens, it usually means that Hive is not able to launch an
spark application. In case of spark on YARN, this application is the
application master. If Hive fails to launch it, or the application master
fails before it can connect back, you would see such error messages. To get
more information, you should check the spark application log.

--Xuefu

On Tue, Dec 15, 2015 at 2:26 PM, Ophir Etzion  wrote:

> Hi,
>
> when trying to do Hive on Spark on CDH5.4.3 I get the following error when
> trying to run a simple query using spark.
>
> I've tried setting everything written here (
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> as well as what the cdh recommends.
>
> any one encountered this as well? (searching for it didn't help much)
>
> the error:
>
> ERROR : Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
> at
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException:
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancel
> client '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited
> before connecting back
> at com.google.common.base.Throwables.propagate(Throwables.java:156)
> at
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:109)
> at
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
> at
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:91)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:55)
> ... 22 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.RuntimeException: Cancel client
> '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before
> connecting back
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:99)
> ... 26 more
> Caused by: java.lang.RuntimeException: Cancel client
> '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before
> connecting back
> at
> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:179)
> at
> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:427)
> ... 1 more
>
> ERROR : Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
> at
>

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang

Hive version is 1.2.1
Spark version is 1.4.1
Hadoop version is 2.5.1

The application_1448873753366_121062 will success in the above mail.

But in some cases all of the applications will fail which caused by
SparkContext
did not initialize after waiting for 15 ms.
See attchment (hive.spark.client.server.connect.timeout is set to 5min).

Thanks.
Best wishes.

2015-12-09 17:56 GMT+08:00 Jone Zhang :

> *Hi, Xuefu:*
>
> *See attachment 1*
> *When the queue resources is not enough.*
> *The application application_1448873753366_121022 will pending.*
> *Two minutes later, the application application_1448873753366_121055 will
> be submited and pending.*
> *And then application_1448873753366_121062.*
>
> *See attachment 2*
> *When the queue resources is free.*
> *The application  application_1448873753366_121062 begin to running.*
> *Application_1448873753366_121022 and application_1448873753366_121055
>  will failed fast.*
>
> *Logs of Application_1448873753366_121022 as follows(same as *
> *application_1448873753366_121055**):*
> Container: container_1448873753366_121022_03_01 on 10.226.136.122_8041
>
> 
> LogType: stderr
> LogLength: 4664
> Log Contents:
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> 15/12/09 16:29:45 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
> 15/12/09 16:29:46 INFO yarn.ApplicationMaster: ApplicationAttemptId:
> appattempt_1448873753366_121022_03
> 15/12/09 16:29:47 INFO spark.SecurityManager: Changing view acls to: mqq
> 15/12/09 16:29:47 INFO spark.SecurityManager: Changing modify acls to: mqq
> 15/12/09 16:29:47 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(mqq); users with modify permissions: Set(mqq)
> 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Starting the user
> application in a separate Thread
> 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization
> 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization ...
> 15/12/09 16:29:47 INFO client.RemoteDriver: Connecting to:
> 10.179.12.140:38842
> 15/12/09 16:29:48 WARN rpc.Rpc: Invalid log level null, reverting to
> default.
> 15/12/09 16:29:48 ERROR yarn.ApplicationMaster: User class threw
> exception: java.util.concurrent.ExecutionException:
> javax.security.sasl.SaslException: Client closed before SASL negotiation
> finished.
> java.util.concurrent.ExecutionException:
> javax.security.sasl.SaslException: Client closed before SASL negotiation
> finished.
> at
> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
> at
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL
> negotiation finished.
> at
> org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:449)
> at
> org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> at
> org.apache.hive.spark.client.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:127)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769)
> at
>

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang

>
> But in some cases all of the applications will fail which caused
> by SparkContext did not initialize after waiting for 15 ms.
> See attchment (hive.spark.client.server.connect.timeout is set to 5min).


*The error log is different  from original mail*

Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041

LogType: stderr
LogLength: 3302
Log Contents:
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
in the future
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
in the future
15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers
for [TERM, HUP, INT]
15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1448873753366_113453_01
15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(mqq); users with modify permissions: Set(mqq)
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user
application in a separate Thread
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
initialization
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
initialization ...
15/12/09 02:11:49 INFO client.RemoteDriver: Connecting to:
10.179.12.140:58013
15/12/09 02:11:49 ERROR yarn.ApplicationMaster: User class threw exception:
java.util.concurrent.ExecutionException: java.net.ConnectException:
Connection refused: /10.179.12.140:58013
java.util.concurrent.ExecutionException: java.net.ConnectException:
Connection refused: /10.179.12.140:58013
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
at
org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
Caused by: java.net.ConnectException: Connection refused: /
10.179.12.140:58013
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.util.concurrent.ExecutionException: java.net.ConnectException:
Connection refused: /10.179.12.140:58013)
15/12/09 02:11:59 ERROR yarn.ApplicationMaster: SparkContext did not
initialize after waiting for 15 ms. Please check earlier log output for
errors. Failing the application.
15/12/09 02:11:59 INFO util.Utils: Shutdown hook called

2015-12-09 19:22 GMT+08:00 Jone Zhang :

> Hive version is 1.2.1
> Spark version is 1.4.1
> Hadoop version is 2.5.1
>
> The application_1448873753366_121062 will success in the above mail.
>
> But in some cases all of the applications will fail which caused by 
> SparkContext
> did not initialize after waiting for 15 ms.
> See attchment (hive.spark.client.server.connect.timeout is set to 5min).
>
> Thanks.
> Best wishes.
>
> 2015-12-09 17:56 GMT+08:00 Jone Zhang :
>
>> *Hi, Xuefu:*
>>
>> *See attachment 1*
>> *When the queue resources is not enough.*
>> *The application application_1448873753366_121022 will pending.*
>> *Two minutes later, the application application_1448873753366_121055 will
>> be submited and pending.*
>> *And then application_1448873753366_121062.*
>>
>> *See attachment 2*
>> *When the queue resources is free.*
>> *The application  application_1448873753366_121062 begin to running.*
>> *Application_1448873753366_121022 and application_1448873753366_121055
>>  will failed fast.*
>>
>> *Logs of Application_1448873753366_121022 as

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Xuefu Zhang

Hi Jone,

Thanks for reporting the problem. When you say there is no enough resource,
do you mean that you cannot launch Yarn application masters?

I feel that we should error out right way if the application cannot be
submitted. Any attempt of resubmitted seems problematic. I'm not sure if
there is such control over this, but I think that's a good direction to
look at. I will check with our spark expert on this.

Thanks,
Xuefu

On Wed, Dec 9, 2015 at 8:48 PM, Jone Zhang  wrote:

> *It seems that the submit number depend on stage of the query.*
> *This query include three stages.*
>
> If queue resources is still *not enough after submit threee applications,** 
> Hive
> client will close.*
> *"**Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'*
> *FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.spark.SparkTask**"*
> *And this time, the port(eg **34682**)  kill in hive client(eg *
> *10.179.12.140**) use to **communicate with RSC **will  lost.*
>
> *The reources of queue is free **after awhile, the AM of three
> applications will fast fail beacause of "**15/12/10 12:28:43 INFO
> client.RemoteDriver: Connecting to:
> 10.179.12.140:34682...java.net.ConnectException: Connection refused:
> /10.179.12.140:34682 **"*
>
> *So, The application will fail if the queue resources if not **enough at
> point of the query be submited, even if the resources is free **after
> awhile.*
> *Do you have more idea about this question?*
>
> *Attch the query*
> set hive.execution.engine=spark;
> set spark.yarn.queue=tms;
> set spark.app.name=t_ad_tms_heartbeat_ok_3;
> insert overwrite table t_ad_tms_heartbeat_ok partition(ds=20151208)
> SELECT
> NVL(a.qimei, b.qimei) AS qimei,
> NVL(b.first_ip,a.user_ip) AS first_ip,
> NVL(a.user_ip, b.last_ip) AS last_ip,
> NVL(b.first_date, a.ds) AS first_date,
> NVL(a.ds, b.last_date) AS last_date,
> NVL(b.first_chid, a.chid) AS first_chid,
> NVL(a.chid, b.last_chid) AS last_chid,
> NVL(b.first_lc, a.lc) AS first_lc,
> NVL(a.lc, b.last_lc) AS last_lc,
> NVL(a.guid, b.guid) AS guid,
> NVL(a.sn, b.sn) AS sn,
> NVL(a.vn, b.vn) AS vn,
> NVL(a.vc, b.vc) AS vc,
> NVL(a.mo, b.mo) AS mo,
> NVL(a.rl, b.rl) AS rl,
> NVL(a.os, b.os) AS os,
> NVL(a.rv, b.rv) AS rv,
> NVL(a.qv, b.qv) AS qv,
> NVL(a.imei, b.imei) AS imei,
> NVL(a.romid, b.romid) AS romid,
> NVL(a.bn, b.bn) AS bn,
> NVL(a.account_type, b.account_type) AS
> account_type,
> NVL(a.account, b.account) AS account
> FROM
> (SELECT
> ds,user_ip,guid,sn,vn,vc,mo,rl,chid,lcid,os,rv,qv,imei,qimei,lc,romid,bn,account_type,account
> FROMt_od_tms_heartbeat_ok
> WHERE   ds = 20151208) a
> FULL OUTER JOIN
> (SELECT
> qimei,first_ip,last_ip,first_date,last_date,first_chid,last_chid,first_lc,last_lc,guid,sn,vn,vc,mo,rl,os,rv,qv,imei,romid,bn,account_type,account
> FROMt_ad_tms_heartbeat_ok
> WHERE   last_date > 20150611
> AND ds = 20151207) b
> ON   a.qimei=b.qimei;
>
> *Thanks.*
> *Best wishes.*
>
> 2015-12-09 19:51 GMT+08:00 Jone Zhang :
>
>> But in some cases all of the applications will fail which caused
>>> by SparkContext did not initialize after waiting for 15 ms.
>>> See attchment (hive.spark.client.server.connect.timeout is set to 5min).
>>
>>
>> *The error log is different  from original mail*
>>
>> Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041
>>
>> 
>> LogType: stderr
>> LogLength: 3302
>> Log Contents:
>> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
>> in the future
>> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
>> in the future
>> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers
>> for [TERM, HUP, INT]
>> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId:
>> appattempt_1448873753366_113453_01
>> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
>> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
>> 15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(mqq); users with modify permissions: Set(mqq)
>> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user
>> application in a separate Thread
>> 15/12/09 02:11:49 INFO

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang

*It seems that the submit number depend on stage of the query.*
*This query include three stages.*

If queue resources is still *not enough after submit threee
applications,** Hive
client will close.*
*"**Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
client.)'*
*FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.spark.SparkTask**"*
*And this time, the port(eg **34682**)  kill in hive client(eg *
*10.179.12.140**) use to **communicate with RSC **will  lost.*

*The reources of queue is free **after awhile, the AM of three applications
will fast fail beacause of "**15/12/10 12:28:43 INFO client.RemoteDriver:
Connecting to: 10.179.12.140:34682...java.net.ConnectException: Connection
refused: /10.179.12.140:34682 **"*

*So, The application will fail if the queue resources if not **enough at
point of the query be submited, even if the resources is free **after
awhile.*
*Do you have more idea about this question?*

*Attch the query*
set hive.execution.engine=spark;
set spark.yarn.queue=tms;
set spark.app.name=t_ad_tms_heartbeat_ok_3;
insert overwrite table t_ad_tms_heartbeat_ok partition(ds=20151208)
SELECT
NVL(a.qimei, b.qimei) AS qimei,
NVL(b.first_ip,a.user_ip) AS first_ip,
NVL(a.user_ip, b.last_ip) AS last_ip,
NVL(b.first_date, a.ds) AS first_date,
NVL(a.ds, b.last_date) AS last_date,
NVL(b.first_chid, a.chid) AS first_chid,
NVL(a.chid, b.last_chid) AS last_chid,
NVL(b.first_lc, a.lc) AS first_lc,
NVL(a.lc, b.last_lc) AS last_lc,
NVL(a.guid, b.guid) AS guid,
NVL(a.sn, b.sn) AS sn,
NVL(a.vn, b.vn) AS vn,
NVL(a.vc, b.vc) AS vc,
NVL(a.mo, b.mo) AS mo,
NVL(a.rl, b.rl) AS rl,
NVL(a.os, b.os) AS os,
NVL(a.rv, b.rv) AS rv,
NVL(a.qv, b.qv) AS qv,
NVL(a.imei, b.imei) AS imei,
NVL(a.romid, b.romid) AS romid,
NVL(a.bn, b.bn) AS bn,
NVL(a.account_type, b.account_type) AS account_type,
NVL(a.account, b.account) AS account
FROM
(SELECT
ds,user_ip,guid,sn,vn,vc,mo,rl,chid,lcid,os,rv,qv,imei,qimei,lc,romid,bn,account_type,account
FROMt_od_tms_heartbeat_ok
WHERE   ds = 20151208) a
FULL OUTER JOIN
(SELECT
qimei,first_ip,last_ip,first_date,last_date,first_chid,last_chid,first_lc,last_lc,guid,sn,vn,vc,mo,rl,os,rv,qv,imei,romid,bn,account_type,account
FROMt_ad_tms_heartbeat_ok
WHERE   last_date > 20150611
AND ds = 20151207) b
ON   a.qimei=b.qimei;

*Thanks.*
*Best wishes.*

2015-12-09 19:51 GMT+08:00 Jone Zhang :

> But in some cases all of the applications will fail which caused
>> by SparkContext did not initialize after waiting for 15 ms.
>> See attchment (hive.spark.client.server.connect.timeout is set to 5min).
>
>
> *The error log is different  from original mail*
>
> Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041
>
> 
> LogType: stderr
> LogLength: 3302
> Log Contents:
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId:
> appattempt_1448873753366_113453_01
> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
> 15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(mqq); users with modify permissions: Set(mqq)
> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user
> application in a separate Thread
> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization
> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization ...
> 15/12/09 02:11:49 INFO client.RemoteDriver: Connecting to:
> 10.179.12.140:58013
> 15/12/09 02:11:49 ERROR yarn.ApplicationMaster: User class threw
> exception: java.util.concurrent.ExecutionException:
> java.net.ConnectException: Connection refused: /10.179.12.140:58013
> java.util.concurrent.ExecutionException: java.net.ConnectException:
> Connection refused: /10.179.12.140:58013
> at
> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at
>

RE: Hive on spark table caching

2015-12-02 Thread Mich Talebzadeh

Hi,

 

Which version of spark are you using please?

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Udit Mehta [mailto:ume...@groupon.com] 
Sent: 02 December 2015 23:01
To: user@hive.apache.org
Subject: Hive on spark table caching

 

Hi,

I have started using Hive on Spark recently and am exploring the benefits it 
offers. I was wondering if Hive on Spark has capabilities to cache table like 
Spark SQL. Or does it do any form of implicit caching in the long running job 
which it starts after running the first query? 

Thanks,

Udit

RE: Hive on spark table caching

2015-12-02 Thread Mich Talebzadeh

OK

 

How did you build your Spark 1.3. Was that from the source code or pre-build 
for Hadoop 2.6 please?

 

The one I have 

 

1.Spark version 1.5.2

2.Hive version 1.2.1

3.Hadoop version 2.6

 

Does not work with Hive on Spark 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Udit Mehta [mailto:ume...@groupon.com] 
Sent: 02 December 2015 23:43
To: user@hive.apache.org
Subject: Re: Hive on spark table caching

 

Im using Spark 1.3 with Hive 1.2.1. I dont mind using a version of Spark higher 
than that but I read somewhere that 1.3 is the version of Spark currently 
supported by Hive. Can I use Spark 1.4 or 1.5 with Hive 1.2.1?

 

On Wed, Dec 2, 2015 at 3:19 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

 

Which version of spark are you using please?

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFaQ=LNdz7nrxyGFUIUTz2qIULQ=HtcUckpQd4kosOR2p8M5TR9HIZCAYDjZ-hXCa7BOA8s=tmCKqNuqXObnpIHLHjchVBKzvP-a-Cf9rJX8pYVyFwg=ywSlj9sSTCyGzGqFcS_R3QekFLuEUHjwuxfkG5kXJgk=>
 

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFaQ=LNdz7nrxyGFUIUTz2qIULQ=HtcUckpQd4kosOR2p8M5TR9HIZCAYDjZ-hXCa7BOA8s=tmCKqNuqXObnpIHLHjchVBKzvP-a-Cf9rJX8pYVyFwg=RGzdspRcDSHc6I8C1iNijlPc3DmGSAvYv14pVHD6RSI=>
 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Udit Mehta [mailto:ume...@groupon.com <mailto:ume...@groupon.com> ] 
Sent: 02 December 2015 23:01
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Hive on spark table caching

 

Hi,

I have started using Hive on Spark recently and am exploring the benefits it 
offers. I was wondering if Hive on Spark has capabilities to cache table like 
Spark SQL. Or does it do any form of implicit caching in the long running job 
which it starts after running the first query? 

Thanks,

Udit

Re: Hive on spark table caching

2015-12-02 Thread Udit Mehta

Im using Spark 1.3 with Hive 1.2.1. I dont mind using a version of Spark
higher than that but I read somewhere that 1.3 is the version of Spark
currently supported by Hive. Can I use Spark 1.4 or 1.5 with Hive 1.2.1?

On Wed, Dec 2, 2015 at 3:19 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Which version of spark are you using please?
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> 
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
> 
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Udit Mehta [mailto:ume...@groupon.com]
> *Sent:* 02 December 2015 23:01
> *To:* user@hive.apache.org
> *Subject:* Hive on spark table caching
>
>
>
> Hi,
>
> I have started using Hive on Spark recently and am exploring the benefits
> it offers. I was wondering if Hive on Spark has capabilities to cache table
> like Spark SQL. Or does it do any form of implicit caching in the long
> running job which it starts after running the first query?
>
> Thanks,
>
> Udit
>

Re: Hive on spark table caching

2015-12-02 Thread Xuefu Zhang

Depending on the query, Hive on Spark does implicitly cache datasets (not
necessarily the input tables) for performance benefits. Such queries
include multi-insert, self-join, self-union, etc. However, no caching
happens across queries at this time, which may be improved in the future.

Thanks,
Xuefu

On Wed, Dec 2, 2015 at 3:00 PM, Udit Mehta  wrote:

> Hi,
>
> I have started using Hive on Spark recently and am exploring the benefits
> it offers. I was wondering if Hive on Spark has capabilities to cache table
> like Spark SQL. Or does it do any form of implicit caching in the long
> running job which it starts after running the first query?
>
> Thanks,
> Udit
>

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-26 Thread Dasun Hegoda

>>>
>>>>   Expects one of [mr, tez, spark].
>>>>
>>>>   Chooses execution engine. Options are: mr (Map reduce, default)
>>>> or tez (hadoop 2 only)
>>>>
>>>> 
>>>>
>>>>   
>>>>
>>>>
>>>>
>>>> 
>>>>
>>>>  spark.eventLog.enabled
>>>>
>>>> *true*
>>>>
>>>> 
>>>>
>>>>Spark event log setting
>>>>
>>>> 
>>>>
>>>>   
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>
>>>>
>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>
>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>
>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>> 978-0-9759693-0-4*
>>>>
>>>> *Publications due shortly:*
>>>>
>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>> 978-0-9563693-3-8
>>>>
>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>> volume
>>>> one out shortly
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential.
>>>> This message is for the designated recipient only, if you are not the
>>>> intended recipient, you should destroy it immediately. Any information in
>>>> this message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this email
>>>> is virus free, therefore neither Peridale Ltd, its subsidiaries nor their
>>>> employees accept any responsibility.
>>>>
>>>>
>>>>
>>>> *From:* Dasun Hegoda [mailto:dasunheg...@gmail.com]
>>>> *Sent:* 23 November 2015 10:40
>>>>
>>>> *To:* user@hive.apache.org
>>>> *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu
>>>>
>>>>
>>>>
>>>> Thank you very much. This is very informative. Do you know how to set
>>>> these in hive-site.xml?
>>>>
>>>>
>>>>
>>>> hive> set spark.master=
>>>>
>>>> hive> set spark.eventLog.enabled=true;
>>>>
>>>> hive> set spark.eventLog.dir=
>>>>
>>>> hive> set spark.executor.memory=512m;
>>>>
>>>> hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>>>>
>>>>
>>>>
>>>> If these set these in hive-site I think we will be able to get through
>>>>
>>>>
>>>>
>>>> On Mon, Nov 23, 2015 at 3:05 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am looking at the set up here
>>>>
>>>>
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>>> .
>>>>
>>>>
>>>>
>>>> First this is about configuration of Hive to work with Spark. These are
>>>> my understanding
>>>>
>>>>
>>>>
>>>> 1.Hive uses Yarn as its resource manager regardless
>>>>
>>>> 2.Hive uses MapReduce as its execution engine by default
>>>>
>>>> 3.Changing the execution engine to that of Spark at the
>>>> configuration level. If you look at Hive configuration file ->
>>>>  $HIVE_HOME/conf/hive-site.xml, you will see that default is mr MapReduce
>>>>
>>>> 
>>>>
>>>> hive.execution.engine
>>>>
>>>> *mr*
>>>>
>>>> 
>>>>
>

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-23 Thread Dasun Hegoda

:7077*
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>
> log4j:WARN Please initialize the log4j system properly.
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> Using Spark's repl log4j profile:
> org/apache/spark/log4j-defaults-repl.properties
>
> To adjust logging level use sc.setLogLevel("INFO")
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>
>   /_/
>
>
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_25)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
> 15/11/23 09:33:56 WARN Utils: Your hostname, rhes564 resolves to a
> loopback address: 127.0.0.1; using 50.140.197.217 instead (on interface
> eth0)
>
> 15/11/23 09:33:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
>
> 15/11/23 09:33:57 WARN MetricsSystem: Using default name DAGScheduler for
> source because spark.app.id is not set.
>
> Spark context available as sc.
>
> 15/11/23 09:34:00 WARN HiveConf: HiveConf of name
> hive.server2.thrift.http.min.worker.threads does not exist
>
> 15/11/23 09:34:00 WARN HiveConf: HiveConf of name
> hive.mapjoin.optimized.keys does not exist
>
> 15/11/23 09:34:00 WARN HiveConf: HiveConf of name
> hive.mapjoin.lazy.hashtable does not exist
>
> 15/11/23 09:34:00 WARN HiveConf: HiveConf of name
> hive.server2.thrift.http.max.worker.threads does not exist
>
> 15/11/23 09:34:00 WARN HiveConf: HiveConf of name
> hive.server2.logging.operation.verbose does not exist
>
> 15/11/23 09:34:00 WARN HiveConf: HiveConf of name
> hive.optimize.multigroupby.common.distincts does not exist
>
> *java.lang.RuntimeException: java.lang.RuntimeException: The root scratch
> dir: /tmp/hive on HDFS should be writable. Current permissions are:
> rwx--*
>
>
>
> That is where I am now and I have reported this spark user group but no
> luck yet.
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Dasun Hegoda [mailto:dasunheg...@gmail.com]
> *Sent:* 23 November 2015 07:05
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu
>
>
>
> Anyone
>
>
>
> On Sat, Nov 21, 2015 at 1:32 PM, Dasun Hegoda <dasunheg...@gmail.com>
> wrote:
>
> Thank you very much but I would like to do the integration of these
> components myself rather than using a packaged distribution. I think I have
> come to right place. Can you please kindly tell me the configuration
> steps run Hive on Spark?
>
>
>
> At least someone please elaborate these steps.
>
>
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> .
>
>
>
> Because at the latter part of the guide configurations are set in the
> Hive runtime shell which is not permanent according to my knowledge.
>
>
>
> Please help me to get this done. Also I'm planning write a detailed guide
> with configuration steps to run Hive on Spark. So others can benefited from
> it and not troubled like me.
>
>
>
> Can someone please kindly tell me the configuration steps run Hive on
> Spark?
>
>
>
>
>
> On Sat, Nov 21, 2015 at 12:28

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-23 Thread Dasun Hegoda

$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95)
at
org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121)
at
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
at
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/11/23 05:56:38 INFO storage.DiskBlockManager: Shutdown hook called
15/11/23 05:56:38 INFO util.ShutdownHookManager: Shutdown hook called
15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-2413b536-c845-4964-a96d-973e5ec02593/httpd-311975ea-ac22-493d-8fd5-0f48b562a9a5
15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-8fefb39a-09b5-443c-b7b4-9c54bce6e245
15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-2413b536-c845-4964-a96d-973e5ec02593/userFiles-b593fc93-c23a-4a9e-aede-ed051f149fcb
15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-2413b536-c845-4964-a96d-973e5ec02593

On Mon, Nov 23, 2015 at 4:19 PM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> As example shows all set in hive-core.xml
>
>
>
> 
>
> hive.execution.engine
>
> *spark*
>
> 
>
>   Expects one of [mr, tez, spark].
>
>   Chooses execution engine. Options are: mr (Map reduce, default) or
> tez (hadoop 2 only)
>
> 
>
>   
>
>
>
> 
>
>  spark.eventLog.enabled
>
> *true*
>
> 
>
>Spark event log setting
>
> 
>
>   
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Dasun Hegoda [mailto:dasunheg...@gmail.com]
> *Sent:* 23 November 2015 10:40
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu
>
>
>
> Thank you very much. This is very informative. Do you know how to set
> these in hive-site.xml?
>
>
>
> hive> set spark.master=
>
> hive> set spark.eventLog.enabled=true;
>
> hive> set spark.eventLog.dir=
>
> hive> set spark.executor.memory=512m;
>
> hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>
>
>
> If these set these in hive-site I think we will be able to get through
>
>
>
> On Mon, Nov 23, 2015 at 3:05 PM, Mich Talebzadeh <m...@pe

RE: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-23 Thread Mich Talebzadeh

As example shows all set in hive-core.xml

 



hive.execution.engine

spark



  Expects one of [mr, tez, spark].

  Chooses execution engine. Options are: mr (Map reduce, default) or tez 
(hadoop 2 only)



  

 



 spark.eventLog.enabled

true



   Spark event log setting



  

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Dasun Hegoda [mailto:dasunheg...@gmail.com] 
Sent: 23 November 2015 10:40
To: user@hive.apache.org
Subject: Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

 

Thank you very much. This is very informative. Do you know how to set these in 
hive-site.xml?

 

hive> set spark.master=

hive> set spark.eventLog.enabled=true;

hive> set spark.eventLog.dir=

hive> set spark.executor.memory=512m; 

hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

 

If these set these in hive-site I think we will be able to get through

 

On Mon, Nov 23, 2015 at 3:05 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

 

I am looking at the set up here

 

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started.

 

First this is about configuration of Hive to work with Spark. These are my 
understanding

 

1.Hive uses Yarn as its resource manager regardless

2.Hive uses MapReduce as its execution engine by default

3.Changing the execution engine to that of Spark at the configuration 
level. If you look at Hive configuration file ->  
$HIVE_HOME/conf/hive-site.xml, you will see that default is mr MapReduce



hive.execution.engine

mr



  Expects one of [mr, tez].

  Chooses execution engine. Options are: mr (Map reduce, default) or tez 
(hadoop 2 only)



  

 

4.If you change that to spark and restart Hive, you will force Hive to use 
spark as its engine. So the choice is either do it at the configuration level 
or session level (i.e set set hive.execution.engine=spark;). For the rest of 
parameters you can do the same. i.e. at hive-core.xml or at session level. 
Personally I would still want hive to use MR engine so I will create 
spark-defaults.conf as mentioned.

5.I then start spark as standalone that works fine

hduser@rhes564::/usr/lib/spark> ./sbin/start-master.sh

starting org.apache.spark.deploy.master.Master, logging to 
/usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out

hduser@rhes564::/usr/lib/spark> more  
/usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out

Spark Command: /usr/java/latest/bin/java -cp 
/usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark/lib/datanucleus-ap

i-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g 
-XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip rhes564 --port 
7077 --webui-port 8080



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

15/11/21 21:41:58 INFO Master: Registered signal handlers for [TERM, HUP, INT]

15/11/21 21:41:58 WARN Utils: Your hostname, rhes564 resolves to a loopback 
address: 127.0.0.1; using 50.140.197.217 instead (on interface eth0)

15/11/21 21:41:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

15/11/21 21:41:59 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

15/11/21 21:41:59 INFO SecurityManager: Changing view acls to: hduser

15/11/21 21:41:59 INFO SecurityManager: Changing modify acls to: hduser

15/11/21

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-23 Thread Dasun Hegoda

licy.rejectedExecution(ThreadPoolExecutor.java:2048)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95)
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121)
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 15/11/23 05:56:38 INFO storage.DiskBlockManager: Shutdown hook called
> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Shutdown hook called
> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-2413b536-c845-4964-a96d-973e5ec02593/httpd-311975ea-ac22-493d-8fd5-0f48b562a9a5
> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-8fefb39a-09b5-443c-b7b4-9c54bce6e245
> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-2413b536-c845-4964-a96d-973e5ec02593/userFiles-b593fc93-c23a-4a9e-aede-ed051f149fcb
> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-2413b536-c845-4964-a96d-973e5ec02593
>
> On Mon, Nov 23, 2015 at 4:19 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
>> As example shows all set in hive-core.xml
>>
>>
>>
>> 
>>
>> hive.execution.engine
>>
>> *spark*
>>
>> 
>>
>>   Expects one of [mr, tez, spark].
>>
>>   Chooses execution engine. Options are: mr (Map reduce, default) or
>> tez (hadoop 2 only)
>>
>> 
>>
>>   
>>
>>
>>
>> 
>>
>>  spark.eventLog.enabled
>>
>> *true*
>>
>> 
>>
>>Spark event log setting
>>
>> 
>>
>>   
>>
>>
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immedi

RE: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-23 Thread Mich Talebzadeh

ot set.

Spark context available as sc.

15/11/23 09:34:00 WARN HiveConf: HiveConf of name 
hive.server2.thrift.http.min.worker.threads does not exist

15/11/23 09:34:00 WARN HiveConf: HiveConf of name hive.mapjoin.optimized.keys 
does not exist

15/11/23 09:34:00 WARN HiveConf: HiveConf of name hive.mapjoin.lazy.hashtable 
does not exist

15/11/23 09:34:00 WARN HiveConf: HiveConf of name 
hive.server2.thrift.http.max.worker.threads does not exist

15/11/23 09:34:00 WARN HiveConf: HiveConf of name 
hive.server2.logging.operation.verbose does not exist

15/11/23 09:34:00 WARN HiveConf: HiveConf of name 
hive.optimize.multigroupby.common.distincts does not exist

java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
/tmp/hive on HDFS should be writable. Current permissions are: rwx--

 

That is where I am now and I have reported this spark user group but no luck 
yet. 

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Dasun Hegoda [mailto:dasunheg...@gmail.com] 
Sent: 23 November 2015 07:05
To: user@hive.apache.org
Subject: Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

 

Anyone

 

On Sat, Nov 21, 2015 at 1:32 PM, Dasun Hegoda <dasunheg...@gmail.com 
<mailto:dasunheg...@gmail.com> > wrote:

Thank you very much but I would like to do the integration of these components 
myself rather than using a packaged distribution. I think I have come to right 
place. Can you please kindly tell me the configuration steps run Hive on Spark?

 

At least someone please elaborate these steps.

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started.

 

Because at the latter part of the guide configurations are set in the Hive 
runtime shell which is not permanent according to my knowledge.

 

Please help me to get this done. Also I'm planning write a detailed guide with 
configuration steps to run Hive on Spark. So others can benefited from it and 
not troubled like me.

 

Can someone please kindly tell me the configuration steps run Hive on Spark?

 

 

On Sat, Nov 21, 2015 at 12:28 PM, Sai Gopalakrishnan 
<sai.gopalakrish...@aspiresys.com <mailto:sai.gopalakrish...@aspiresys.com> > 
wrote:

Hi everyone,

 

Thank you for your responses. I think Mich's suggestion is a great one, will go 
with it. As Alan suggested, using compactor in Hive should help out with 
managing the delta files.

 

@Dasun, pardon me for deviating from the topic. Regarding configuration, you 
could try a packaged distribution (Hortonworks , Cloudera or MapR) like  Jörn 
Franke said. I use Hortonworks, its open-source and compatible with Linux and 
Windows, provides detailed documentation for installation and can be installed 
in less than a day provided you're all set with the hardware. 
http://hortonworks.com/hdp/downloads/ 


 <http://hortonworks.com/hdp/downloads/> 

Download Hadoop - Hortonworks

Download Apache Hadoop for the enterprise with Hortonworks Data Platform. Data 
access, storage, governance, security and operations across Linux and Windows

 <http://hortonworks.com/hdp/downloads/> Read more...

 

 

Regards,

Sai

 


  _  


From: Dasun Hegoda <dasunheg...@gmail.com <mailto:dasunheg...@gmail.com> >
Sent: Saturday, November 21, 2015 8:00 AM
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu 

 

Hi Mich, Hi Sai, Hi Jorn,

Thank you very much for the information. I think we are deviating from the 
original question. Hive on Spark on Ubuntu. Can you please kindly tell me the 
configuration steps?

 

 

 

On Fri, Nov 20, 2015 at 11:10 PM, Jörn Franke <jornfra...@gmail.com 
<mailto:jornfra...@gmail.com> > wrote:

I think the most r

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-23 Thread Dasun Hegoda

 to
>> master spark://master:7077...
>> 15/11/23 05:56:38 ERROR util.SparkUncaughtExceptionHandler: Uncaught
>> exception in thread Thread[appclient-registration-retry-thread,5,main]
>> java.util.concurrent.RejectedExecutionException: Task
>> java.util.concurrent.FutureTask@236f0e3a rejected from
>> java.util.concurrent.ThreadPoolExecutor@500f1402[Running, pool size = 1,
>> active threads = 0, queued tasks = 0, completed tasks = 1]
>> at
>> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>> at
>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>> at
>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>> at
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132)
>> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
>> at
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> 15/11/23 05:56:38 INFO storage.DiskBlockManager: Shutdown hook called
>> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Shutdown hook called
>> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
>> /tmp/spark-2413b536-c845-4964-a96d-973e5ec02593/httpd-311975ea-ac22-493d-8fd5-0f48b562a9a5
>> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
>> /tmp/spark-8fefb39a-09b5-443c-b7b4-9c54bce6e245
>> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
>> /tmp/spark-2413b536-c845-4964-a96d-973e5ec02593/userFiles-b593fc93-c23a-4a9e-aede-ed051f149fcb
>> 15/11/23 05:56:38 INFO util.ShutdownHookManager: Deleting directory
>> /tmp/spark-2413b536-c845-4964-a96d-973e5ec02593
>>
>> On Mon, Nov 23, 2015 at 4:19 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>>> As example shows all set in hive-core.xml
>>>
>>>
>>>
>>> 
>>>
>>> hive.execution.engine
>>>
>>> *spark*
>>>
>>> 
>>>
>>>   Expects one of [mr, tez, spark].
>>>
>>>   Chooses execution engine. Options are: mr (Map reduce, default) or
>>> tez (hadoop 2 only)
>>>
>>> 
>>>
>>>   
>>>
>>>
>>>
>>> 
>>>
>>>  spark.eventLog.enabled
>>>
>>> *true*
>>>
>>> 
>>>
>>>Spark event log setting
>>>
>>> 
>>>
>>>   
>>>
>>>
>>>
>>>
>>>
>>> Mich Talebzadeh
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
&g

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-22 Thread Dasun Hegoda

Anyone

On Sat, Nov 21, 2015 at 1:32 PM, Dasun Hegoda <dasunheg...@gmail.com> wrote:

> Thank you very much but I would like to do the integration of these
> components myself rather than using a packaged distribution. I think I have
> come to right place. Can you please kindly tell me the configuration
> steps run Hive on Spark?
>
> At least someone please elaborate these steps.
>
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> .
>
> Because at the latter part of the guide configurations are set in the
> Hive runtime shell which is not permanent according to my knowledge.
>
> Please help me to get this done. Also I'm planning write a detailed guide
> with configuration steps to run Hive on Spark. So others can benefited from
> it and not troubled like me.
>
> Can someone please kindly tell me the configuration steps run Hive on
> Spark?
>
>
> On Sat, Nov 21, 2015 at 12:28 PM, Sai Gopalakrishnan <
> sai.gopalakrish...@aspiresys.com> wrote:
>
>> Hi everyone,
>>
>>
>> Thank you for your responses. I think Mich's suggestion is a great one,
>> will go with it. As Alan suggested, using compactor in Hive should help out
>> with managing the delta files.
>>
>>
>> @Dasun, pardon me for deviating from the topic. Regarding configuration,
>> you could try a packaged distribution (Hortonworks , Cloudera or MapR)
>> like  Jörn Franke said. I use Hortonworks, its open-source and compatible
>> with Linux and Windows, provides detailed documentation for installation
>> and can be installed in less than a day provided you're all set with the
>> hardware. http://hortonworks.com/hdp/downloads/
>> <http://hortonworks.com/hdp/downloads/>
>> Download Hadoop - Hortonworks
>> Download Apache Hadoop for the enterprise with Hortonworks Data Platform.
>> Data access, storage, governance, security and operations across Linux and
>> Windows
>> Read more... <http://hortonworks.com/hdp/downloads/>
>>
>>
>> Regards,
>>
>> Sai
>>
>> --
>> *From:* Dasun Hegoda <dasunheg...@gmail.com>
>> *Sent:* Saturday, November 21, 2015 8:00 AM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu
>>
>> Hi Mich, Hi Sai, Hi Jorn,
>>
>> Thank you very much for the information. I think we are deviating from
>> the original question. Hive on Spark on Ubuntu. Can you please kindly tell
>> me the configuration steps?
>>
>>
>>
>> On Fri, Nov 20, 2015 at 11:10 PM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> I think the most recent versions of cloudera or Hortonworks should
>>> include all these components - try their Sandboxes.
>>>
>>> On 20 Nov 2015, at 12:54, Dasun Hegoda <dasunheg...@gmail.com> wrote:
>>>
>>> Where can I get a Hadoop distribution containing these technologies?
>>> Link?
>>>
>>> On Fri, Nov 20, 2015 at 5:22 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> I recommend to use a Hadoop distribution containing these technologies.
>>>> I think you get also other useful tools for your scenario, such as Auditing
>>>> using sentry or ranger.
>>>>
>>>> On 20 Nov 2015, at 10:48, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>>>>
>>>> Well
>>>>
>>>>
>>>>
>>>> “I'm planning to deploy Hive on Spark but I can't find the
>>>> installation steps. I tried to read the official '[Hive on Spark][1]' guide
>>>> but it has problems. As an example it says under 'Configuring Yarn'
>>>> `yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler`
>>>> but does not imply where should I do it. Also as per the guide
>>>> configurations are set in the Hive runtime shell which is not permanent
>>>> according to my knowledge.”
>>>>
>>>>
>>>>
>>>> You can do that in yarn-site.xml file which is normally under
>>>> $HADOOP_HOME/etc/hadoop.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>
>>>> A Winning Strategy: Running the m

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-21 Thread Dasun Hegoda

Thank you very much but I would like to do the integration of these
components myself rather than using a packaged distribution. I think I have
come to right place. Can you please kindly tell me the configuration steps
run Hive on Spark?

At least someone please elaborate these steps.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
.

Because at the latter part of the guide configurations are set in the Hive
runtime shell which is not permanent according to my knowledge.

Please help me to get this done. Also I'm planning write a detailed guide
with configuration steps to run Hive on Spark. So others can benefited from
it and not troubled like me.

Can someone please kindly tell me the configuration steps run Hive on Spark?


On Sat, Nov 21, 2015 at 12:28 PM, Sai Gopalakrishnan <
sai.gopalakrish...@aspiresys.com> wrote:

> Hi everyone,
>
>
> Thank you for your responses. I think Mich's suggestion is a great one,
> will go with it. As Alan suggested, using compactor in Hive should help out
> with managing the delta files.
>
>
> @Dasun, pardon me for deviating from the topic. Regarding configuration,
> you could try a packaged distribution (Hortonworks , Cloudera or MapR)
> like  Jörn Franke said. I use Hortonworks, its open-source and compatible
> with Linux and Windows, provides detailed documentation for installation
> and can be installed in less than a day provided you're all set with the
> hardware. http://hortonworks.com/hdp/downloads/
> <http://hortonworks.com/hdp/downloads/>
> Download Hadoop - Hortonworks
> Download Apache Hadoop for the enterprise with Hortonworks Data Platform.
> Data access, storage, governance, security and operations across Linux and
> Windows
> Read more... <http://hortonworks.com/hdp/downloads/>
>
>
> Regards,
>
> Sai
>
> --
> *From:* Dasun Hegoda <dasunheg...@gmail.com>
> *Sent:* Saturday, November 21, 2015 8:00 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu
>
> Hi Mich, Hi Sai, Hi Jorn,
>
> Thank you very much for the information. I think we are deviating from the
> original question. Hive on Spark on Ubuntu. Can you please kindly tell me
> the configuration steps?
>
>
>
> On Fri, Nov 20, 2015 at 11:10 PM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> I think the most recent versions of cloudera or Hortonworks should
>> include all these components - try their Sandboxes.
>>
>> On 20 Nov 2015, at 12:54, Dasun Hegoda <dasunheg...@gmail.com> wrote:
>>
>> Where can I get a Hadoop distribution containing these technologies?
>> Link?
>>
>> On Fri, Nov 20, 2015 at 5:22 PM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> I recommend to use a Hadoop distribution containing these technologies.
>>> I think you get also other useful tools for your scenario, such as Auditing
>>> using sentry or ranger.
>>>
>>> On 20 Nov 2015, at 10:48, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>>>
>>> Well
>>>
>>>
>>>
>>> “I'm planning to deploy Hive on Spark but I can't find the installation
>>> steps. I tried to read the official '[Hive on Spark][1]' guide but it has
>>> problems. As an example it says under 'Configuring Yarn'
>>> `yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler`
>>> but does not imply where should I do it. Also as per the guide
>>> configurations are set in the Hive runtime shell which is not permanent
>>> according to my knowledge.”
>>>
>>>
>>>
>>> You can do that in yarn-site.xml file which is normally under
>>> $HADOOP_HOME/etc/hadoop.
>>>
>>>
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Mich Talebzadeh
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>&g

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-20 Thread Dasun Hegoda

With respective to your steps,

1.It's MySQL database
2.Yes
3.Not sure whether I need it, Do you think I need it? if so why?
4.Sqoop will get data from MySQL to Hadoop
5.Correct
6.I want to use Hive on Spark for real time data processing on Hadoop


Daily/periodic changes from RDBMS to Hive will be done through Oozhie and
Sqoop. As per my research I can write a periodic Sqoop/Pig job to be
executed by the Oohie. Hope it will work.

All I want to do is run Hive on Spark on Ubuntu. Can you please kindly tell
me the configuration steps?

1 2 >

1 - 100 of 149 matches

Mail list logo