Re: Should we consider Spark3 support for Hive on Spark

2022-08-24 Thread Jan Fili
Yes exactly.

This is what is recommended, cause hive on Spark has little interest.
However there is nothing enforcing not todo it.

Important to me cause i sit here and work on grassroots marrieng hive on
kafka-streams.

Owen O'Malley  schrieb am Mi., 24. Aug. 2022, 18:51:

> Hive on Spark is not recommended. The recommended path is to use either
> Tez or LLAP. If you already are using Spark 3, it would be far easier to
> use Spark SQL.
>
> .. Owen
>
> On Wed, Aug 24, 2022 at 3:46 AM Fred Bai 
> wrote:
>
>> Hi everyone:
>>
>> Do we have any support for Hive on Spark? I need Hive on Spark, but my
>> Spark version is 3.X.
>>
>> I found Hive incompatible with Spark3, I modify a lot of code to be
>> compatible.
>>
>> Hive on Spark has deprecated?
>>
>> And. Hive on Spark is very slow when the job executes.
>>
>


Re: Should we consider Spark3 support for Hive on Spark

2022-08-24 Thread Owen O'Malley
Hive on Spark is not recommended. The recommended path is to use either Tez
or LLAP. If you already are using Spark 3, it would be far easier to use
Spark SQL.

.. Owen

On Wed, Aug 24, 2022 at 3:46 AM Fred Bai  wrote:

> Hi everyone:
>
> Do we have any support for Hive on Spark? I need Hive on Spark, but my
> Spark version is 3.X.
>
> I found Hive incompatible with Spark3, I modify a lot of code to be
> compatible.
>
> Hive on Spark has deprecated?
>
> And. Hive on Spark is very slow when the job executes.
>


Re: Should we consider Spark3 support for Hive on Spark

2022-08-24 Thread hernan saab via user
Do you honestly believe that a non apache community dev can just fork hive, 
modify the code and make it work with any version of spark? Is that what you 
are suggesting? Please, let us know if this is the case.


Sent from Yahoo Mail for iPad


On Wednesday, August 24, 2022, 6:13 AM, Jan Fili  wrote:

Can always fork to get things going ;)

*sorry for spam*

Am Mi., 24. Aug. 2022 um 06:34 Uhr schrieb hernan saab via user
:
>
>
> Hey Fred,
>
> Contrary to what you may perceive from the hive docs, what you are trying to 
> do is not plug and play.
> Only apache committers can do what you are trying to do.
> Use canned solutions such as confluence or AWS EMR and save yourself weeks of 
> wasted effort.
>
> Hernán
> On Tuesday, August 23, 2022 at 08:46:30 PM PDT, Fred Bai 
>  wrote:
>
>
> Hi everyone:
>
> Do we have any support for Hive on Spark? I need Hive on Spark, but my Spark 
> version is 3.X.
>
> I found Hive incompatible with Spark3, I modify a lot of code to be 
> compatible.
>
> Hive on Spark has deprecated?
>
> And. Hive on Spark is very slow when the job executes.





Re: Should we consider Spark3 support for Hive on Spark

2022-08-24 Thread Jan Fili
Can always fork to get things going ;)

*sorry for spam*

Am Mi., 24. Aug. 2022 um 06:34 Uhr schrieb hernan saab via user
:
>
>
> Hey Fred,
>
> Contrary to what you may perceive from the hive docs, what you are trying to 
> do is not plug and play.
> Only apache committers can do what you are trying to do.
> Use canned solutions such as confluence or AWS EMR and save yourself weeks of 
> wasted effort.
>
> Hernán
> On Tuesday, August 23, 2022 at 08:46:30 PM PDT, Fred Bai 
>  wrote:
>
>
> Hi everyone:
>
> Do we have any support for Hive on Spark? I need Hive on Spark, but my Spark 
> version is 3.X.
>
> I found Hive incompatible with Spark3, I modify a lot of code to be 
> compatible.
>
> Hive on Spark has deprecated?
>
> And. Hive on Spark is very slow when the job executes.


Re: Should we consider Spark3 support for Hive on Spark

2022-08-23 Thread hernan saab via user
 
Hey Fred,
Contrary to what you may perceive from the hive docs, what you are trying to do 
is not plug and play.Only apache committers can do what you are trying to 
do.Use canned solutions such as confluence or AWS EMR and save yourself weeks 
of wasted effort.
HernánOn Tuesday, August 23, 2022 at 08:46:30 PM PDT, Fred Bai 
 wrote:  
 
 Hi everyone:
Do we have any support for Hive on Spark? I need Hive on Spark, but my Spark 
version is 3.X.
I found Hive incompatible with Spark3, I modify a lot of code to be compatible.
Hive on Spark has deprecated? 
And. Hive on Spark is very slow when the job executes.  

Should we consider Spark3 support for Hive on Spark

2022-08-23 Thread Fred Bai
Hi everyone:

Do we have any support for Hive on Spark? I need Hive on Spark, but my
Spark version is 3.X.

I found Hive incompatible with Spark3, I modify a lot of code to be
compatible.

Hive on Spark has deprecated?

And. Hive on Spark is very slow when the job executes.


Re: Time to Remove Hive-on-Spark

2022-04-12 Thread Peter Vary
+1 from my side too.

I have created PR against the current branch.
Still needs some work, and as many reviews as possible, because it is quite
big, and I might made some mistakes
https://issues.apache.org/jira/browse/HIVE-26134
https://github.com/apache/hive/pull/3201

Thanks,
Peter

On Thu, 10 Feb 2022 at 17:43, Zoltan Haindrich  wrote:

> Hey,
>
> I think there is no real interest in this feature; we don't have
> users/contributors backing it - last development was around 2018 October;
> there were ~2 bugfix commits ever
> since that...we should stop carrying dead weight...another 2 weeks went by
> since Stamatis have reminded us that after 1.5 years(!) nothing have
> changed.
>
> +1 on removing it
>
> cheers,
> Zoltan
>
> you may inspect some of the recent changes with:
> git log -c `find . -type f -path '**/spark/**'|grep -v xml|grep -v
> properties|grep -v q.out`
>
>
> On 1/28/22 2:32 PM, Stamatis Zampetakis wrote:
> > Hi team,
> >
> > Almost one year has passed since the last exchange in this discussion and
> > if I am not wrong there has been no effort to revive Hive-on-Spark. To be
> > more precise, I don't think I have seen any Spark related JIRA for quite
> > some time now and although I don't want to rush into conclusions, there
> > does not seem to be any community member involved in maintaining or
> adding
> > new features in this part of the code.
> >
> > Keeping dead code in the repository does not do any good to the project
> and
> > puts a non-negligible burden to future maintainers.
> >
> > Clearly, we cannot make a new Hive release where a major feature is
> > completely untested so either someone commits to re-enable/fix the
> > respective tests soon or we move forward the work started by David and
> drop
> > support for Hive-on-Spark.
> >
> > I would like to ask the community if there is anyone who can take up this
> > maintenance task and enable/fix Spark related tests in the next month or
> so?
> >
> > Best,
> > Stamatis
> >
> > On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo 
> > wrote:
> >
> >> I do not know how it works for most of the world. But in cloudera where
> the
> >> TEZ options were never popular hive-on-spark represents a solid way to
> get
> >> things done for small datasets lower latency.
> >>
> >> As for the spark adoption. You know a while ago I came up with some
> ways to
> >> make hive more  spark like. One of them was a found a way to make
> "compile"
> >> a hive keyword so folks could build UDFs on the fly. It was such an
> >> uphil climb. Folks found a way to make it disabled by default for
> security.
> >> Then later when things moved from CLI to beeline it was like the ONLY
> thing
> >> that I found not ported. Like it was extremely frustrating.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jul 27, 2020 at 3:19 PM David  wrote:
> >>
> >>> Hello  Xuefu,
> >>>
> >>> I am not part of the Cloudera Hive product team,  though I volunteer to
> >>> work on small projects from time to time.  Perhaps someone from that
> team
> >>> can chime in with some of their thoughts, but personally, I think that
> in
> >>> the long run, there will be more of a merge between Hive-on-Spark and
> >> other
> >>> Spark-native offerings.  I'm not sure what the differentiation will be
> >>> going forward.  With that said, are there any developers on this
> mailing
> >>> list who are willing to take on the maintenance effort of keeping HoS
> >>> moving forward?
> >>>
> >>> http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
> >>>
> >>>
> >>
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html
> >>>
> >>>
> >>> Thanks.
> >>>
> >>> On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang  wrote:
> >>>
> >>>> Previous reasoning seemed to suggest a lack of user adoption. Now we
> >> are
> >>>> concerned about ongoing maintenance effort. Both are valid
> >>> considerations.
> >>>> However, I think we should have ways to find out the answers.
> >> Therefore,
> >>> I
> >>>> suggest the following be carried out:
> >>>>
> >>>> 1. Send out the proposal (removing Hive on Spark) to users including
> >>>

Re: Time to Remove Hive-on-Spark

2022-02-10 Thread Zoltan Haindrich

Hey,

I think there is no real interest in this feature; we don't have users/contributors backing it - last development was around 2018 October; there were ~2 bugfix commits ever 
since that...we should stop carrying dead weight...another 2 weeks went by since Stamatis have reminded us that after 1.5 years(!) nothing have changed.


+1 on removing it

cheers,
Zoltan

you may inspect some of the recent changes with:
git log -c `find . -type f -path '**/spark/**'|grep -v xml|grep -v 
properties|grep -v q.out`


On 1/28/22 2:32 PM, Stamatis Zampetakis wrote:

Hi team,

Almost one year has passed since the last exchange in this discussion and
if I am not wrong there has been no effort to revive Hive-on-Spark. To be
more precise, I don't think I have seen any Spark related JIRA for quite
some time now and although I don't want to rush into conclusions, there
does not seem to be any community member involved in maintaining or adding
new features in this part of the code.

Keeping dead code in the repository does not do any good to the project and
puts a non-negligible burden to future maintainers.

Clearly, we cannot make a new Hive release where a major feature is
completely untested so either someone commits to re-enable/fix the
respective tests soon or we move forward the work started by David and drop
support for Hive-on-Spark.

I would like to ask the community if there is anyone who can take up this
maintenance task and enable/fix Spark related tests in the next month or so?

Best,
Stamatis

On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo 
wrote:


I do not know how it works for most of the world. But in cloudera where the
TEZ options were never popular hive-on-spark represents a solid way to get
things done for small datasets lower latency.

As for the spark adoption. You know a while ago I came up with some ways to
make hive more  spark like. One of them was a found a way to make "compile"
a hive keyword so folks could build UDFs on the fly. It was such an
uphil climb. Folks found a way to make it disabled by default for security.
Then later when things moved from CLI to beeline it was like the ONLY thing
that I found not ported. Like it was extremely frustrating.






On Mon, Jul 27, 2020 at 3:19 PM David  wrote:


Hello  Xuefu,

I am not part of the Cloudera Hive product team,  though I volunteer to
work on small projects from time to time.  Perhaps someone from that team
can chime in with some of their thoughts, but personally, I think that in
the long run, there will be more of a merge between Hive-on-Spark and

other

Spark-native offerings.  I'm not sure what the differentiation will be
going forward.  With that said, are there any developers on this mailing
list who are willing to take on the maintenance effort of keeping HoS
moving forward?

http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/



https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html



Thanks.

On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang  wrote:


Previous reasoning seemed to suggest a lack of user adoption. Now we

are

concerned about ongoing maintenance effort. Both are valid

considerations.

However, I think we should have ways to find out the answers.

Therefore,

I

suggest the following be carried out:

1. Send out the proposal (removing Hive on Spark) to users including
user@hive.apache.org and get their feedback.
2. Ask if any developers on this mailing list are willing to take on

the

maintenance effort.

I'm concerned about user impact because I can still see issues being
reported on HoS from time to time. I'm more concerned about the future

of

Hive if we narrow Hive neutrality on execution engines, which will

possibly

force more Hive users to migrate to other alternatives such as Spark

SQL,

which is already eroding Hive's user base.

Being open and neutral used to be Hive's most admired strengths.

Thanks,
Xuefu


On Wed, Jul 22, 2020 at 8:46 AM Alan Gates 

wrote:



An important point here is I don't believe David is proposing to

remove

Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing

to

support it in existing 2 and 3 lines makes sense, but since no one

has

maintained it on trunk for some time and it does not work with many

of

the

newer features it should be removed from trunk.

Alan.

On Tue, Jul 21, 2020 at 4:10 PM Chao Sun  wrote:


Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a

very

large scale in production right now and I don't think we have any

plan

to

change it soon.



On Tue, Jul 21, 2020 at 11:28 AM David  wrote:


Hello,

Thanks for the feedback.

Just a quick recap: I did propose this @dev and I received

unanimous

+1's

from the community.  After a couple months, I created the PR.

Certainly open to discussion, but there hasn't been any

discussion

thus

far

because there have been no objectio

Re: Time to Remove Hive-on-Spark

2022-01-28 Thread Stamatis Zampetakis
Hi team,

Almost one year has passed since the last exchange in this discussion and
if I am not wrong there has been no effort to revive Hive-on-Spark. To be
more precise, I don't think I have seen any Spark related JIRA for quite
some time now and although I don't want to rush into conclusions, there
does not seem to be any community member involved in maintaining or adding
new features in this part of the code.

Keeping dead code in the repository does not do any good to the project and
puts a non-negligible burden to future maintainers.

Clearly, we cannot make a new Hive release where a major feature is
completely untested so either someone commits to re-enable/fix the
respective tests soon or we move forward the work started by David and drop
support for Hive-on-Spark.

I would like to ask the community if there is anyone who can take up this
maintenance task and enable/fix Spark related tests in the next month or so?

Best,
Stamatis

On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo 
wrote:

> I do not know how it works for most of the world. But in cloudera where the
> TEZ options were never popular hive-on-spark represents a solid way to get
> things done for small datasets lower latency.
>
> As for the spark adoption. You know a while ago I came up with some ways to
> make hive more  spark like. One of them was a found a way to make "compile"
> a hive keyword so folks could build UDFs on the fly. It was such an
> uphil climb. Folks found a way to make it disabled by default for security.
> Then later when things moved from CLI to beeline it was like the ONLY thing
> that I found not ported. Like it was extremely frustrating.
>
>
>
>
>
>
> On Mon, Jul 27, 2020 at 3:19 PM David  wrote:
>
> > Hello  Xuefu,
> >
> > I am not part of the Cloudera Hive product team,  though I volunteer to
> > work on small projects from time to time.  Perhaps someone from that team
> > can chime in with some of their thoughts, but personally, I think that in
> > the long run, there will be more of a merge between Hive-on-Spark and
> other
> > Spark-native offerings.  I'm not sure what the differentiation will be
> > going forward.  With that said, are there any developers on this mailing
> > list who are willing to take on the maintenance effort of keeping HoS
> > moving forward?
> >
> > http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
> >
> >
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html
> >
> >
> > Thanks.
> >
> > On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang  wrote:
> >
> > > Previous reasoning seemed to suggest a lack of user adoption. Now we
> are
> > > concerned about ongoing maintenance effort. Both are valid
> > considerations.
> > > However, I think we should have ways to find out the answers.
> Therefore,
> > I
> > > suggest the following be carried out:
> > >
> > > 1. Send out the proposal (removing Hive on Spark) to users including
> > > user@hive.apache.org and get their feedback.
> > > 2. Ask if any developers on this mailing list are willing to take on
> the
> > > maintenance effort.
> > >
> > > I'm concerned about user impact because I can still see issues being
> > > reported on HoS from time to time. I'm more concerned about the future
> of
> > > Hive if we narrow Hive neutrality on execution engines, which will
> > possibly
> > > force more Hive users to migrate to other alternatives such as Spark
> SQL,
> > > which is already eroding Hive's user base.
> > >
> > > Being open and neutral used to be Hive's most admired strengths.
> > >
> > > Thanks,
> > > Xuefu
> > >
> > >
> > > On Wed, Jul 22, 2020 at 8:46 AM Alan Gates 
> wrote:
> > >
> > > > An important point here is I don't believe David is proposing to
> remove
> > > > Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing
> > to
> > > > support it in existing 2 and 3 lines makes sense, but since no one
> has
> > > > maintained it on trunk for some time and it does not work with many
> of
> > > the
> > > > newer features it should be removed from trunk.
> > > >
> > > > Alan.
> > > >
> > > > On Tue, Jul 21, 2020 at 4:10 PM Chao Sun  wrote:
> > > >
> > > > > Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a
> > > very
> > > > > large scale in production right now and I don't think we hav

hive on spark submit to yarn pools?

2021-11-03 Thread igyu
hive on spark + sentry

jdbc:hive2://hiveser:1/;user=ajxtj;password=123456;hive.server2.proxy.user=jztwk

 pro.put("hiveconf:spark.yarn.queue","root.jzyc");


I use yarn pool root.jzyc

but root.jzc only hive and ajxtj can use.

so I want to use jztkw to submit root.jzyc
but now application use hive to submit root.jzyc



igyu


Re: Removing Hive-on-Spark

2020-07-27 Thread David
Hello Stephen,

Thanks for your interest.  Can you please elaborate a bit more on your
question?

Thanks.

On Mon, Jul 27, 2020 at 4:11 PM Stephen Boesch  wrote:

> Why would it be this way instead of the other way around?
>
> On Mon, 27 Jul 2020 at 12:27, David  wrote:
>
>> Hello Hive Users.
>>
>> I am interested in gathering some feedback on the adoption of
>> Hive-on-Spark.
>>
>> Does anyone care to volunteer their usage information and would you be
>> open to removing it in favor of Hive-on-Tez in subsequent releases of Hive?
>>
>> If you are on MapReduce still, would you be open to migrating to Tez?
>>
>> Thanks.
>>
>


Re: Removing Hive-on-Spark

2020-07-27 Thread Stephen Boesch
Why would it be this way instead of the other way around?

On Mon, 27 Jul 2020 at 12:27, David  wrote:

> Hello Hive Users.
>
> I am interested in gathering some feedback on the adoption of
> Hive-on-Spark.
>
> Does anyone care to volunteer their usage information and would you be
> open to removing it in favor of Hive-on-Tez in subsequent releases of Hive?
>
> If you are on MapReduce still, would you be open to migrating to Tez?
>
> Thanks.
>


Removing Hive-on-Spark

2020-07-27 Thread David
Hello Hive Users.

I am interested in gathering some feedback on the adoption of Hive-on-Spark.

Does anyone care to volunteer their usage information and would you be open
to removing it in favor of Hive-on-Tez in subsequent releases of Hive?

If you are on MapReduce still, would you be open to migrating to Tez?

Thanks.


About the Hive on Spark 3.x upgrade plan

2020-05-14 Thread 王嘉廉
Hello,
May I ask about the Hive on Spark 3.x upgrade plan? 
I found the newest dependent Spark version is 2.4.5 on Master Branch.


Thanks,
--- wjl






Re: Running Hive on Spark

2019-03-13 Thread Rajesh Balamohan
"Hive on Spark" uses Spark purely as execution engine. It would not get the
benefits of codegen and other optimizations of Spark.

If it is mainly for testing, OOTB parameters should work without issues.

However, Tez has lot better edge than Hive on Spark.

Some of the areas where Hive on Spark needs to catch up are,

* No support for auto reduce parallelism.
* Not full dynamic partition pruning is supported.
* Fetchers can start only when all mappers are complete. This can be a huge
painpoint in lot of cases.
* Have to specify CombinedInputFormat for tackling small files, but that
has issues in splitting.

~Rajesh.B

On Tue, Mar 12, 2019 at 2:25 PM Daniel Mateus Pires 
wrote:

> Hi Rajesh,
>
> I'm trying to further my understanding of the various interactions and
> set-ups for Hive + Spark
>
> My understanding so far is that running queries against the
> SparkThriftServer uses the SparkSQL engine whereas the HiveServer2 + Hive +
> Spark execution engine uses Hive primitives and only uses Spark for the
> actual computations
>
> I get your question about "why would I do that?" But my goal right now is
> to understand "what does it mean if I do that"
>
> Best regards
> Daniel
>
> On Tue 12 Mar 2019, 02:21 Rajesh Balamohan,  wrote:
>
>> Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
>> good enough for this.
>>
>> Is there any specific reason for moving from tez to spark as execution
>> engine?
>>
>> ~Rajesh.B
>>
>> On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires 
>> wrote:
>>
>>> Hi there,
>>>
>>> I would like to run Hive using Spark as the execution engine and I'm
>>> pretty confused with the set up.
>>>
>>> For reference I'm using AWS EMR.
>>>
>>> First, I'm confused at the difference between running Hive with Spark as
>>> its execution engine sending queries to Hive using HiveServer2 (Thrift),
>>> and using the SparkThriftServer (I thought it was built on top of
>>> HiveServer2) ? Could I read more about the differences somewhere ?
>>>
>>> I followed the following docs:
>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>> and after changing the execution engine from the EMR default (tez) to
>>> spark, I can see the difference on the HiveServer2 UI at port 10002 where
>>> now the steps show "spark" as the execution engine.
>>>
>>> However I've set up the following config to get the Spark History Server
>>> displaying queries coming through JDBC and I can see queries sent to the
>>> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
>>> engine of Spark (port 1)
>>>
>>> set spark.eventLog.enabled=true;
>>> set spark.master=localhost:18080;
>>> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
>>> set spark.executor.memory=512m;
>>> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>>>
>>> Thanks!
>>>
>>


Re: Running Hive on Spark

2019-03-12 Thread Daniel Mateus Pires
Hi Rajesh,

I'm trying to further my understanding of the various interactions and
set-ups for Hive + Spark

My understanding so far is that running queries against the
SparkThriftServer uses the SparkSQL engine whereas the HiveServer2 + Hive +
Spark execution engine uses Hive primitives and only uses Spark for the
actual computations

I get your question about "why would I do that?" But my goal right now is
to understand "what does it mean if I do that"

Best regards
Daniel

On Tue 12 Mar 2019, 02:21 Rajesh Balamohan,  wrote:

> Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
> good enough for this.
>
> Is there any specific reason for moving from tez to spark as execution
> engine?
>
> ~Rajesh.B
>
> On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires 
> wrote:
>
>> Hi there,
>>
>> I would like to run Hive using Spark as the execution engine and I'm
>> pretty confused with the set up.
>>
>> For reference I'm using AWS EMR.
>>
>> First, I'm confused at the difference between running Hive with Spark as
>> its execution engine sending queries to Hive using HiveServer2 (Thrift),
>> and using the SparkThriftServer (I thought it was built on top of
>> HiveServer2) ? Could I read more about the differences somewhere ?
>>
>> I followed the following docs:
>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>> and after changing the execution engine from the EMR default (tez) to
>> spark, I can see the difference on the HiveServer2 UI at port 10002 where
>> now the steps show "spark" as the execution engine.
>>
>> However I've set up the following config to get the Spark History Server
>> displaying queries coming through JDBC and I can see queries sent to the
>> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
>> engine of Spark (port 1)
>>
>> set spark.eventLog.enabled=true;
>> set spark.master=localhost:18080;
>> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
>> set spark.executor.memory=512m;
>> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>>
>> Thanks!
>>
>


Re: Running Hive on Spark

2019-03-11 Thread Rajesh Balamohan
Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
good enough for this.

Is there any specific reason for moving from tez to spark as execution
engine?

~Rajesh.B

On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires 
wrote:

> Hi there,
>
> I would like to run Hive using Spark as the execution engine and I'm
> pretty confused with the set up.
>
> For reference I'm using AWS EMR.
>
> First, I'm confused at the difference between running Hive with Spark as
> its execution engine sending queries to Hive using HiveServer2 (Thrift),
> and using the SparkThriftServer (I thought it was built on top of
> HiveServer2) ? Could I read more about the differences somewhere ?
>
> I followed the following docs:
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> and after changing the execution engine from the EMR default (tez) to
> spark, I can see the difference on the HiveServer2 UI at port 10002 where
> now the steps show "spark" as the execution engine.
>
> However I've set up the following config to get the Spark History Server
> displaying queries coming through JDBC and I can see queries sent to the
> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
> engine of Spark (port 1)
>
> set spark.eventLog.enabled=true;
> set spark.master=localhost:18080;
> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
> set spark.executor.memory=512m;
> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>
> Thanks!
>


Running Hive on Spark

2019-03-11 Thread Daniel Mateus Pires
Hi there,

I would like to run Hive using Spark as the execution engine and I'm pretty
confused with the set up.

For reference I'm using AWS EMR.

First, I'm confused at the difference between running Hive with Spark as
its execution engine sending queries to Hive using HiveServer2 (Thrift),
and using the SparkThriftServer (I thought it was built on top of
HiveServer2) ? Could I read more about the differences somewhere ?

I followed the following docs:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
and after changing the execution engine from the EMR default (tez) to
spark, I can see the difference on the HiveServer2 UI at port 10002 where
now the steps show "spark" as the execution engine.

However I've set up the following config to get the Spark History Server
displaying queries coming through JDBC and I can see queries sent to the
SparkThriftServer (port 10001) but not to the HiveServer2 with execution
engine of Spark (port 1)

set spark.eventLog.enabled=true;
set spark.master=localhost:18080;
set spark.eventLog.dir=hdfs:///var/log/spark/apps;
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Thanks!


Re: Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sachin janani
Yes I build the same way as you suggested but no luck.


Regards,
Sachin Janani

On Tue, Jun 19, 2018 at 7:13 PM, Sahil Takiar 
wrote:

> You should be building Spark without Hive. For Spark 2.3.0, the command is:
>
> ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided
> <https://cwiki.apache.org/confluence/display/Hive/hadoop-2.7,parquet-provided,orc-provided>
> "
>
> If you check the distribution after running the command, it shouldn't
> contain any Hive jars.
>
> On Tue, Jun 19, 2018 at 7:18 AM, Sachin janani  > wrote:
>
>> It shows following exception :
>>
>>
>>
>>
>> *java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUTat
>> org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205)at
>> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)at
>> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)*
>> *at
>> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)*
>>
>>
>> After looking at jira SPARK-13446
>> <https://issues.apache.org/jira/browse/SPARK-13446> it seems that it is
>> fixed but as per the source code it is not. So I resolved by changing spark
>> code and rebuilding the spark binaries again but now it shows new error
>> NoSuchMethodError. As per my preliminary investigation it seems that Spark
>> is build with Hive 1.2.1 which is causing this issues. Can you please let
>> me know if i am missing anything?
>>
>>
>> Regards,
>> Sachin Janani
>>
>> On Tue, Jun 19, 2018 at 5:38 PM, Sahil Takiar 
>> wrote:
>>
>>> I updated the doc to reflect that Hive 3.0.0 works with Spark 2.3.0.
>>> What issues are you seeing?
>>>
>>> On Tue, Jun 19, 2018 at 7:03 AM, Sachin janani <
>>> sachin.janani...@gmail.com> wrote:
>>>
>>>> This is the same link which I followed. As per this link for
>>>> spark-2.3.0 we need to use hive master instead of hive 3.0.0. Also we
>>>> need to custom build spark without hive dependencies but after trying
>>>> all this it shows some compatibility issues.
>>>>
>>>>
>>>> Regards,
>>>> Sachin Janani
>>>>
>>>> On Tue, Jun 19, 2018 at 5:02 PM, Sahil Takiar 
>>>> wrote:
>>>> > Yes, Hive 3.0.0 works with Spark 2.3.0 - this section of the wiki has
>>>> > details on which Hive releases support which Spark versions.
>>>> >
>>>> > On Tue, Jun 19, 2018 at 5:59 AM, Sachin janani <
>>>> sachin.janani...@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >> I am trying to run hive on spark by following the steps mentioned
>>>> >> here-
>>>> >> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spa
>>>> rk%3A+Getting+Started
>>>> >> , but getting many compatibility issues like NoSuchMethodError,
>>>> >> NoSuchFieldException etc. So just need to know if it works and
>>>> whether
>>>> >> someone tried it out,
>>>> >>
>>>> >>
>>>> >> Thanks and Regards,
>>>> >> --
>>>> >> Sachin Janani
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Sahil Takiar
>>>> > Software Engineer
>>>> > takiar.sa...@gmail.com | (510) 673-0309
>>>>
>>>>
>>>>
>>>> --
>>>> Sachin Janani
>>>>
>>>
>>>
>>>
>>> --
>>> Sahil Takiar
>>> Software Engineer
>>> takiar.sa...@gmail.com | (510) 673-0309
>>>
>>
>>
>>
>> --
>> *Sachin Janani*
>>
>>
>
>
>
> --
> Sahil Takiar
> Software Engineer
> takiar.sa...@gmail.com | (510) 673-0309
>



-- 
*Sachin Janani*


Re: Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sahil Takiar
You should be building Spark without Hive. For Spark 2.3.0, the command is:

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided
<https://cwiki.apache.org/confluence/display/Hive/hadoop-2.7,parquet-provided,orc-provided>
"

If you check the distribution after running the command, it shouldn't
contain any Hive jars.

On Tue, Jun 19, 2018 at 7:18 AM, Sachin janani 
wrote:

> It shows following exception :
>
>
>
>
> *java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUTat
> org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205)at
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)at
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)*
> *at
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)*
>
>
> After looking at jira SPARK-13446
> <https://issues.apache.org/jira/browse/SPARK-13446> it seems that it is
> fixed but as per the source code it is not. So I resolved by changing spark
> code and rebuilding the spark binaries again but now it shows new error
> NoSuchMethodError. As per my preliminary investigation it seems that Spark
> is build with Hive 1.2.1 which is causing this issues. Can you please let
> me know if i am missing anything?
>
>
> Regards,
> Sachin Janani
>
> On Tue, Jun 19, 2018 at 5:38 PM, Sahil Takiar 
> wrote:
>
>> I updated the doc to reflect that Hive 3.0.0 works with Spark 2.3.0. What
>> issues are you seeing?
>>
>> On Tue, Jun 19, 2018 at 7:03 AM, Sachin janani <
>> sachin.janani...@gmail.com> wrote:
>>
>>> This is the same link which I followed. As per this link for
>>> spark-2.3.0 we need to use hive master instead of hive 3.0.0. Also we
>>> need to custom build spark without hive dependencies but after trying
>>> all this it shows some compatibility issues.
>>>
>>>
>>> Regards,
>>> Sachin Janani
>>>
>>> On Tue, Jun 19, 2018 at 5:02 PM, Sahil Takiar 
>>> wrote:
>>> > Yes, Hive 3.0.0 works with Spark 2.3.0 - this section of the wiki has
>>> > details on which Hive releases support which Spark versions.
>>> >
>>> > On Tue, Jun 19, 2018 at 5:59 AM, Sachin janani <
>>> sachin.janani...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >> I am trying to run hive on spark by following the steps mentioned
>>> >> here-
>>> >> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spa
>>> rk%3A+Getting+Started
>>> >> , but getting many compatibility issues like NoSuchMethodError,
>>> >> NoSuchFieldException etc. So just need to know if it works and whether
>>> >> someone tried it out,
>>> >>
>>> >>
>>> >> Thanks and Regards,
>>> >> --
>>> >> Sachin Janani
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Sahil Takiar
>>> > Software Engineer
>>> > takiar.sa...@gmail.com | (510) 673-0309
>>>
>>>
>>>
>>> --
>>> Sachin Janani
>>>
>>
>>
>>
>> --
>> Sahil Takiar
>> Software Engineer
>> takiar.sa...@gmail.com | (510) 673-0309
>>
>
>
>
> --
> *Sachin Janani*
>
>



-- 
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309


Re: Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sachin janani
It shows following exception :




*java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUTat
org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205)at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)at
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)*
*at
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)*


After looking at jira SPARK-13446
<https://issues.apache.org/jira/browse/SPARK-13446> it seems that it is
fixed but as per the source code it is not. So I resolved by changing spark
code and rebuilding the spark binaries again but now it shows new error
NoSuchMethodError. As per my preliminary investigation it seems that Spark
is build with Hive 1.2.1 which is causing this issues. Can you please let
me know if i am missing anything?


Regards,
Sachin Janani

On Tue, Jun 19, 2018 at 5:38 PM, Sahil Takiar 
wrote:

> I updated the doc to reflect that Hive 3.0.0 works with Spark 2.3.0. What
> issues are you seeing?
>
> On Tue, Jun 19, 2018 at 7:03 AM, Sachin janani  > wrote:
>
>> This is the same link which I followed. As per this link for
>> spark-2.3.0 we need to use hive master instead of hive 3.0.0. Also we
>> need to custom build spark without hive dependencies but after trying
>> all this it shows some compatibility issues.
>>
>>
>> Regards,
>> Sachin Janani
>>
>> On Tue, Jun 19, 2018 at 5:02 PM, Sahil Takiar 
>> wrote:
>> > Yes, Hive 3.0.0 works with Spark 2.3.0 - this section of the wiki has
>> > details on which Hive releases support which Spark versions.
>> >
>> > On Tue, Jun 19, 2018 at 5:59 AM, Sachin janani <
>> sachin.janani...@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >> I am trying to run hive on spark by following the steps mentioned
>> >> here-
>> >> https://cwiki.apache.org/confluence/display/Hive/Hive+on+
>> Spark%3A+Getting+Started
>> >> , but getting many compatibility issues like NoSuchMethodError,
>> >> NoSuchFieldException etc. So just need to know if it works and whether
>> >> someone tried it out,
>> >>
>> >>
>> >> Thanks and Regards,
>> >> --
>> >> Sachin Janani
>> >
>> >
>> >
>> >
>> > --
>> > Sahil Takiar
>> > Software Engineer
>> > takiar.sa...@gmail.com | (510) 673-0309
>>
>>
>>
>> --
>> Sachin Janani
>>
>
>
>
> --
> Sahil Takiar
> Software Engineer
> takiar.sa...@gmail.com | (510) 673-0309
>



-- 
*Sachin Janani*


Re: Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sahil Takiar
I updated the doc to reflect that Hive 3.0.0 works with Spark 2.3.0. What
issues are you seeing?

On Tue, Jun 19, 2018 at 7:03 AM, Sachin janani 
wrote:

> This is the same link which I followed. As per this link for
> spark-2.3.0 we need to use hive master instead of hive 3.0.0. Also we
> need to custom build spark without hive dependencies but after trying
> all this it shows some compatibility issues.
>
>
> Regards,
> Sachin Janani
>
> On Tue, Jun 19, 2018 at 5:02 PM, Sahil Takiar 
> wrote:
> > Yes, Hive 3.0.0 works with Spark 2.3.0 - this section of the wiki has
> > details on which Hive releases support which Spark versions.
> >
> > On Tue, Jun 19, 2018 at 5:59 AM, Sachin janani <
> sachin.janani...@gmail.com>
> > wrote:
> >>
> >> Hi,
> >> I am trying to run hive on spark by following the steps mentioned
> >> here-
> >> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started
> >> , but getting many compatibility issues like NoSuchMethodError,
> >> NoSuchFieldException etc. So just need to know if it works and whether
> >> someone tried it out,
> >>
> >>
> >> Thanks and Regards,
> >> --
> >> Sachin Janani
> >
> >
> >
> >
> > --
> > Sahil Takiar
> > Software Engineer
> > takiar.sa...@gmail.com | (510) 673-0309
>
>
>
> --
> Sachin Janani
>



-- 
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309


Re: Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sachin janani
This is the same link which I followed. As per this link for
spark-2.3.0 we need to use hive master instead of hive 3.0.0. Also we
need to custom build spark without hive dependencies but after trying
all this it shows some compatibility issues.


Regards,
Sachin Janani

On Tue, Jun 19, 2018 at 5:02 PM, Sahil Takiar  wrote:
> Yes, Hive 3.0.0 works with Spark 2.3.0 - this section of the wiki has
> details on which Hive releases support which Spark versions.
>
> On Tue, Jun 19, 2018 at 5:59 AM, Sachin janani 
> wrote:
>>
>> Hi,
>> I am trying to run hive on spark by following the steps mentioned
>> here-
>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>> , but getting many compatibility issues like NoSuchMethodError,
>> NoSuchFieldException etc. So just need to know if it works and whether
>> someone tried it out,
>>
>>
>> Thanks and Regards,
>> --
>> Sachin Janani
>
>
>
>
> --
> Sahil Takiar
> Software Engineer
> takiar.sa...@gmail.com | (510) 673-0309



-- 
Sachin Janani


Re: Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sahil Takiar
Yes, Hive 3.0.0 works with Spark 2.3.0 - this
<https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-VersionCompatibility>
section of the wiki has details on which Hive releases support which Spark
versions.

On Tue, Jun 19, 2018 at 5:59 AM, Sachin janani 
wrote:

> Hi,
> I am trying to run hive on spark by following the steps mentioned
> here- https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started
> , but getting many compatibility issues like NoSuchMethodError,
> NoSuchFieldException etc. So just need to know if it works and whether
> someone tried it out,
>
>
> Thanks and Regards,
> --
> Sachin Janani
>



-- 
Sahil Takiar
Software Engineer
takiar.sa...@gmail.com | (510) 673-0309


Is hive on spark works with spark 2.3.0

2018-06-19 Thread Sachin janani
Hi,
I am trying to run hive on spark by following the steps mentioned
here- 
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
, but getting many compatibility issues like NoSuchMethodError,
NoSuchFieldException etc. So just need to know if it works and whether
someone tried it out,


Thanks and Regards,
-- 
Sachin Janani


Re: hive on spark - why is it so hard?

2017-10-02 Thread Jörn Franke
You should try with TEZ+LLAP.

Additionally you will need to compare different configurations.

Finally just any comparison is meaningless.
You should use queries, data and file formats that your users are using later.

> On 2. Oct 2017, at 03:06, Stephen Sprague  wrote:
> 
> so...  i made some progress after much copying of jar files around (as 
> alluded to by Gopal previously on this thread).
> 
> 
> following the instructions here: 
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> 
> and doing this as instructed will leave off about a dozen or so jar files 
> that spark'll need:
>   ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz 
> "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
> 
> i ended copying the missing jars to $SPARK_HOME/jars but i would have 
> preferred to just add a path(s) to the spark class path but i did not find 
> any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but 
> i don't see the analagous var in spark - i don't think it inherits the hive 
> classpath.
> 
> anyway a simple query is now working under Hive On Spark so i think i might 
> be over the hump.  Now its a matter of comparing the performance with Tez.
> 
> Cheers,
> Stephen.
> 
> 
>> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:
>> ok.. getting further.  seems now i have to deploy hive to all nodes in the 
>> cluster - don't think i had to do that before but not a big deal to do it 
>> now.
>> 
>> for me:
>> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
>> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>> 
>> on all three nodes now.
>> 
>> i started spark master on the namenode and i started spark slaves (2) on two 
>> datanodes of the cluster. 
>> 
>> so far so good.
>> 
>> now i run my usual test command.
>> 
>> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set 
>> hive.execution.engine=spark; select date_key, count(*) from 
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>> 
>> i get a little further now and find the stderr from the Spark Web UI 
>> interface (nice) and it reports this:
>> 
>> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
>> spark://Worker@172.19.79.127:40145
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:483)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
>>  at 
>> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>>  at 
>> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>>  at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>>  ... 6 more
>> 
>> 
>> searching around the internet i find this is probably a compatibility issue.
>> 
>> i know. i know. no surprise here.  
>> 
>> so i guess i just got to the point where everybody else is... build spark 
>> w/o hive. 
>> 
>> lemme see what happens next.
>> 
>> 
>> 
>> 
>> 
>>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:
>>> thanks.  I haven't had a chance to dig into this again today but i do 
>>> appreciate the pointer.  I'll keep you posted.
>>> 
>>>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar  
>>>> wrote:
>>>> You can try increasing the value of hive.spark.client.connect.timeout. 
>>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver 
>>>> gets launched in a YARN container (assuming you are running Spark in 
>>>> yarn-client mode), so you just have to find the logs for that container.
>>>> 
>>>> --Sahil
>>>> 
>>>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  
>>>>> wrote:
>>>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's 
>>>>> where i'm at now.
>>>>> 
>>>>> 2017-09-26T21:10:38,8

Re: hive on spark - why is it so hard?

2017-10-01 Thread Stephen Sprague
so...  i made some progress after much copying of jar files around (as
alluded to by Gopal previously on this thread).


following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

and doing this as instructed will leave off about a dozen or so jar files
that spark'll need:
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

i ended copying the missing jars to $SPARK_HOME/jars but i would have
preferred to just add a path(s) to the spark class path but i did not find
any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH
but i don't see the analagous var in spark - i don't think it inherits the
hive classpath.

anyway a simple query is now working under Hive On Spark so i think i might
be over the hump.  Now its a matter of comparing the performance with Tez.

Cheers,
Stephen.


On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:

> ok.. getting further.  seems now i have to deploy hive to all nodes in the
> cluster - don't think i had to do that before but not a big deal to do it
> now.
>
> for me:
> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>
> on all three nodes now.
>
> i started spark master on the namenode and i started spark slaves (2) on
> two datanodes of the cluster.
>
> so far so good.
>
> now i run my usual test command.
>
> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
> i get a little further now and find the stderr from the Spark Web UI
> interface (nice) and it reports this:
>
> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
> spark://Worker@172.19.79.127:40145
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused 
> by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
>   at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>   at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>   at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>   ... 6 more
>
>
>
> searching around the internet i find this is probably a compatibility
> issue.
>
> i know. i know. no surprise here.
>
> so i guess i just got to the point where everybody else is... build spark
> w/o hive.
>
> lemme see what happens next.
>
>
>
>
>
> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague 
> wrote:
>
>> thanks.  I haven't had a chance to dig into this again today but i do
>> appreciate the pointer.  I'll keep you posted.
>>
>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
>> wrote:
>>
>>> You can try increasing the value of hive.spark.client.connect.timeout.
>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>>> gets launched in a YARN container (assuming you are running Spark in
>>> yarn-client mode), so you just have to find the logs for that container.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>>> wrote:
>>>
>>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>>> Here's where i'm at now.
>>>>
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>>> CreateSubmissionResponse:
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "action" : "CreateSubmissionResponse",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "message" : "Driver successfully submitted as 
>>>> driver-20170926211038-0003",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImp

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague
ion.*
>>> java.util.concurrent.ExecutionException: 
>>> java.util.concurrent.TimeoutException:
>>> Timed out waiting for client connection.
>>> at 
>>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>>> at 
>>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at 
>>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>>
>>>
>>> i'll dig some more tomorrow.
>>>
>>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
>>> wrote:
>>>
>>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>>> you posted on my progress.
>>>>
>>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan <
>>>> gop...@apache.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>>>>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException:
>>>>> Failed to create spark client.
>>>>>
>>>>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>>>>> build.
>>>>>
>>>>> Build Hive first, use that version to build Spark, use that Spark
>>>>> version to rebuild Hive.
>>>>>
>>>>> I have to do this to make it work because Spark contains Hive jars and
>>>>> Hive contains Spark jars in the class-path.
>>>>>
>>>>> And specifically I have to edit the pom.xml files, instead of passing
>>>>> in params with -Dspark.version, because the installed pom files don't get
>>>>> replacements from the build args.
>>>>>
>>>>> Cheers,
>>>>> Gopal
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Sahil Takiar
>> Software Engineer at Cloudera
>> takiar.sa...@gmail.com | (510) 673-0309
>>
>
>


Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague
thanks.  I haven't had a chance to dig into this again today but i do
appreciate the pointer.  I'll keep you posted.

On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
wrote:

> You can try increasing the value of hive.spark.client.connect.timeout.
> Would also suggest taking a look at the HoS Remote Driver logs. The driver
> gets launched in a YARN container (assuming you are running Spark in
> yarn-client mode), so you just have to find the logs for that container.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
> wrote:
>
>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
>> where i'm at now.
>>
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>> CreateSubmissionResponse:
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "action" : "CreateSubmissionResponse",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "serverSparkVersion" : "2.2.0",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "submissionId" : "driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "success" : true
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>> 9.73.136:8020 from dwr: closed
>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Clien
>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>> from dwr: stopped, remaining connections 0
>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>> *Possible reasons include network issues, errors in remote driver or the
>> cluster has no available resources, etc.*
>> *Please check YARN or Spark driver's logs for further information.*
>> java.util.concurrent.ExecutionException: 
>> java.util.concurrent.TimeoutException:
>> Timed out waiting for client connection.
>> at 
>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>> [hive-exec-2.3.0.jar:2.3.0]
>>
>>
>> i'll dig some more tomorrow.
>>
>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
>> wrote:
>>
>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>> you posted on my progress.
>>>
>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >> > wrote:
>>>
>>>> Hi,
>>>>

Re: hive on spark - why is it so hard?

2017-09-27 Thread Sahil Takiar
You can try increasing the value of hive.spark.client.connect.timeout.
Would also suggest taking a look at the HoS Remote Driver logs. The driver
gets launched in a YARN container (assuming you are running Spark in
yarn-client mode), so you just have to find the logs for that container.

--Sahil

On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  wrote:

> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
> where i'm at now.
>
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
> CreateSubmissionResponse:
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "action" : "CreateSubmissionResponse",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "message" : "Driver successfully submitted as driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "serverSparkVersion" : "2.2.0",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "submissionId" : "driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "success" : true
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.
> 19.73.136:8020 from dwr: closed
> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
> Clien
> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
> from dwr: stopped, remaining connections 0
> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
> client.SparkClientImpl: Timed out waiting for client to connect.
> *Possible reasons include network issues, errors in remote driver or the
> cluster has no available resources, etc.*
> *Please check YARN or Spark driver's logs for further information.*
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException:
> Timed out waiting for client connection.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
> at 
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
> [hive-exec-2.3.0.jar:2.3.0]
> at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.
> createRemoteClient(RemoteHiveSparkClient.java:101)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.
> RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
> createHiveSparkClient(HiveSparkClientFactory.java:73)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.session.
> SparkSessionImpl.open(SparkSessionImpl.java:62)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.session.
> SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.
> getSparkSession(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.getSparkMemoryAndCores(
> SetSparkReducerParallelism.java:236) [hive-exec-2.3.0.jar:2.3.0]
>
>
> i'll dig some more tomorrow.
>
> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
> wrote:
>
>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
>> posted on my progress.
>>
>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
>> wrote:
>>
>>> Hi,
>>>
>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>>> to create spark client.
>>>
>>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>>> build.
>>>
>>> Build Hive first, use that version to build Spark, use that Spark
>>> version to rebuild Hive.
>>>
>>> I have to do this to make it work because Spark contains Hive jars and
>>> Hive contains Spark jars in the class-path.
>>>
>>> And specifically I have to edit the pom.xml files, instead of passing in
>>> params with -Dspark.version, because the installed pom files don't get
>>> replacements from the build args.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
where i'm at now.

2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
CreateSubmissionResponse:
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"action" : "CreateSubmissionResponse",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"message" : "Driver successfully submitted as driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"serverSparkVersion" : "2.2.0",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"submissionId" : "driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"success" : true
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: closed
2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Clien
t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: stopped, remaining connections 0
2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
client.SparkClientImpl: Timed out waiting for client to connect.
*Possible reasons include network issues, errors in remote driver or the
cluster has no available resources, etc.*
*Please check YARN or Spark driver's logs for further information.*
java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Timed out waiting for client
connection.
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
~[netty-all-4.0.29.Final.jar:4.0.29.Final]
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
[hive-exec-2.3.0.jar:2.3.0]


i'll dig some more tomorrow.

On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague  wrote:

> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
> posted on my progress.
>
> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
> wrote:
>
>> Hi,
>>
>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>> to create spark client.
>>
>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>> build.
>>
>> Build Hive first, use that version to build Spark, use that Spark version
>> to rebuild Hive.
>>
>> I have to do this to make it work because Spark contains Hive jars and
>> Hive contains Spark jars in the class-path.
>>
>> And specifically I have to edit the pom.xml files, instead of passing in
>> params with -Dspark.version, because the installed pom files don't get
>> replacements from the build args.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
posted on my progress.

On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
> to create spark client.
>
> I get inexplicable errors with Hive-on-Spark unless I do a three step
> build.
>
> Build Hive first, use that version to build Spark, use that Spark version
> to rebuild Hive.
>
> I have to do this to make it work because Spark contains Hive jars and
> Hive contains Spark jars in the class-path.
>
> And specifically I have to edit the pom.xml files, instead of passing in
> params with -Dspark.version, because the installed pom files don't get
> replacements from the build args.
>
> Cheers,
> Gopal
>
>
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
l(CalcitePlanner.java:286)
>> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze
>> (BaseSemanticAnalyzer.java:258)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
>> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java
>> :1316)
>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226)
>> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriv
>> er.java:233)
>> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.
>> java:184)
>> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.
>> java:403)
>> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.
>> java:336)
>> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver
>> .java:787)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>
>>
>> I bugs me that that class is in spark-core_2.11-2.2.0.jar yet so
>> seemingly out of reach. :(
>>
>>
>>
>> On Tue, Sep 26, 2017 at 2:44 PM, Sahil Takiar 
>> wrote:
>>
>>> Hey Stephen,
>>>
>>> Can you send the full stack trace for the NoClassDefFoundError? For Hive
>>> 2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
>>> of Spark, but we only test with Spark 2.0.0.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague 
>>> wrote:
>>>
>>>> * i've installed hive 2.3 and spark 2.2
>>>>
>>>> * i've read this doc plenty of times -> https://cwiki.apache.org/confl
>>>> uence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>>>
>>>> * i run this query:
>>>>
>>>>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
>>>> hive.execution.engine=spark; select date_key, count(*) from
>>>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>>>>
>>>>
>>>> * i get this error:
>>>>
>>>> *   Exception in thread "main" java.lang.NoClassDefFoundError:
>>>> org/apache/spark/scheduler/SparkListenerInterface*
>>>>
>>>>
>>>> * this class in:
>>>>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>>>
>>>> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>>>>
>>>> * i have updated hive-site.xml to set spark.yarn.jars to it.
>>>>
>>>> * i see this is the console:
>>>>
>>>> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3
>>>> main] spark.HiveSparkClientFactory: load spark property from hive
>>>> configuration (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.co
>>>> m:8020/spark-2.2-jars/*).
>>>>
>>>> * i see this on the console
>>>>
>>>> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>>>> main] client.SparkClientImpl: Running client driver with argv:
>>>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
>>>> /tmp/spark-submit.6105784757200912217.properties --class
>>>> org.apache.hive.spark.client.RemoteDriver
>>>> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar --remote-host
>>>> dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
>>>> hive.spark.client.connect.timeout=1000 --conf
>>>> hive.spark.client.server.connect.timeout=9 --conf
>>>> hive.spark.client.channel.log.level=null --conf
>>>> hive.spark.client.rpc.max.size=52428800 --conf
>>>> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>>>> --conf hive.spark.client.rpc.server.address=null
>>>>
>>>> * i even print out CLASSPATH in this script:
>>>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit
>>>>
>>>> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>>> is in it.
>>>>
>>>> ​so i ask... what am i missing?
>>>>
>>>> thanks,
>>>> Stephen​
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Sahil Takiar
>>> Software Engineer at Cloudera
>>> takiar.sa...@gmail.com | (510) 673-0309
>>>
>>
>>
>
>
> --
> Sahil Takiar
> Software Engineer at Cloudera
> takiar.sa...@gmail.com | (510) 673-0309
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Gopal Vijayaraghavan
Hi,

> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark 
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create 
> spark client.
 
I get inexplicable errors with Hive-on-Spark unless I do a three step build.

Build Hive first, use that version to build Spark, use that Spark version to 
rebuild Hive.

I have to do this to make it work because Spark contains Hive jars and Hive 
contains Spark jars in the class-path.

And specifically I have to edit the pom.xml files, instead of passing in params 
with -Dspark.version, because the installed pom files don't get replacements 
from the build args.

Cheers,
Gopal




Re: hive on spark - why is it so hard?

2017-09-26 Thread Sahil Takiar
   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>
>
> I bugs me that that class is in spark-core_2.11-2.2.0.jar yet so seemingly
> out of reach. :(
>
>
>
> On Tue, Sep 26, 2017 at 2:44 PM, Sahil Takiar 
> wrote:
>
>> Hey Stephen,
>>
>> Can you send the full stack trace for the NoClassDefFoundError? For Hive
>> 2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
>> of Spark, but we only test with Spark 2.0.0.
>>
>> --Sahil
>>
>> On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague 
>> wrote:
>>
>>> * i've installed hive 2.3 and spark 2.2
>>>
>>> * i've read this doc plenty of times -> https://cwiki.apache.org/confl
>>> uence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>>
>>> * i run this query:
>>>
>>>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
>>> hive.execution.engine=spark; select date_key, count(*) from
>>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>>>
>>>
>>> * i get this error:
>>>
>>> *   Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/scheduler/SparkListenerInterface*
>>>
>>>
>>> * this class in:
>>>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>>
>>> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>>>
>>> * i have updated hive-site.xml to set spark.yarn.jars to it.
>>>
>>> * i see this is the console:
>>>
>>> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3
>>> main] spark.HiveSparkClientFactory: load spark property from hive
>>> configuration (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.co
>>> m:8020/spark-2.2-jars/*).
>>>
>>> * i see this on the console
>>>
>>> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>>> main] client.SparkClientImpl: Running client driver with argv:
>>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
>>> /tmp/spark-submit.6105784757200912217.properties --class
>>> org.apache.hive.spark.client.RemoteDriver 
>>> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
>>> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
>>> hive.spark.client.connect.timeout=1000 --conf
>>> hive.spark.client.server.connect.timeout=9 --conf
>>> hive.spark.client.channel.log.level=null --conf
>>> hive.spark.client.rpc.max.size=52428800 --conf
>>> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>>> --conf hive.spark.client.rpc.server.address=null
>>>
>>> * i even print out CLASSPATH in this script:
>>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit
>>>
>>> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>> is in it.
>>>
>>> ​so i ask... what am i missing?
>>>
>>> thanks,
>>> Stephen​
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sahil Takiar
>> Software Engineer at Cloudera
>> takiar.sa...@gmail.com | (510) 673-0309
>>
>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
stack trace for the NoClassDefFoundError? For Hive
> 2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
> of Spark, but we only test with Spark 2.0.0.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague 
> wrote:
>
>> * i've installed hive 2.3 and spark 2.2
>>
>> * i've read this doc plenty of times -> https://cwiki.apache.org/confl
>> uence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>
>> * i run this query:
>>
>>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
>> hive.execution.engine=spark; select date_key, count(*) from
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>>
>>
>> * i get this error:
>>
>> *   Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface*
>>
>>
>> * this class in:
>>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>
>> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>>
>> * i have updated hive-site.xml to set spark.yarn.jars to it.
>>
>> * i see this is the console:
>>
>> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3
>> main] spark.HiveSparkClientFactory: load spark property from hive
>> configuration (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.co
>> m:8020/spark-2.2-jars/*).
>>
>> * i see this on the console
>>
>> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] client.SparkClientImpl: Running client driver with argv:
>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
>> /tmp/spark-submit.6105784757200912217.properties --class
>> org.apache.hive.spark.client.RemoteDriver 
>> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
>> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
>> hive.spark.client.connect.timeout=1000 --conf
>> hive.spark.client.server.connect.timeout=9 --conf
>> hive.spark.client.channel.log.level=null --conf
>> hive.spark.client.rpc.max.size=52428800 --conf
>> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>> --conf hive.spark.client.rpc.server.address=null
>>
>> * i even print out CLASSPATH in this script:
>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit
>>
>> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is
>> in it.
>>
>> ​so i ask... what am i missing?
>>
>> thanks,
>> Stephen​
>>
>>
>>
>>
>>
>>
>
>
> --
> Sahil Takiar
> Software Engineer at Cloudera
> takiar.sa...@gmail.com | (510) 673-0309
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Sahil Takiar
Hey Stephen,

Can you send the full stack trace for the NoClassDefFoundError? For Hive
2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
of Spark, but we only test with Spark 2.0.0.

--Sahil

On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague  wrote:

> * i've installed hive 2.3 and spark 2.2
>
> * i've read this doc plenty of times -> https://cwiki.apache.org/
> confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> * i run this query:
>
>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
>
> * i get this error:
>
> *   Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/scheduler/SparkListenerInterface*
>
>
> * this class in:
>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>
> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>
> * i have updated hive-site.xml to set spark.yarn.jars to it.
>
> * i see this is the console:
>
> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main]
> spark.HiveSparkClientFactory: load spark property from hive configuration
> (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*
> ).
>
> * i see this on the console
>
> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
> client.SparkClientImpl: Running client driver with argv:
> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
> /tmp/spark-submit.6105784757200912217.properties --class
> org.apache.hive.spark.client.RemoteDriver 
> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
> hive.spark.client.connect.timeout=1000 --conf 
> hive.spark.client.server.connect.timeout=9
> --conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
> --conf hive.spark.client.rpc.server.address=null
>
> * i even print out CLASSPATH in this script: /usr/lib/spark-2.2.0-bin-
> hadoop2.6/bin/spark-submit
>
> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is
> in it.
>
> ​so i ask... what am i missing?
>
> thanks,
> Stephen​
>
>
>
>
>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309


hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
* i've installed hive 2.3 and spark 2.2

* i've read this doc plenty of times ->
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

* i run this query:

   hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'


* i get this error:

*   Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/scheduler/SparkListenerInterface*


* this class in:
  /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar

* i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars

* i have updated hive-site.xml to set spark.yarn.jars to it.

* i see this is the console:

2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main]
spark.HiveSparkClientFactory: load spark property from hive configuration
(spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*).

* i see this on the console

2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
client.SparkClientImpl: Running client driver with argv:
/usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
/tmp/spark-submit.6105784757200912217.properties --class
org.apache.hive.spark.client.RemoteDriver
/usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar --remote-host
dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
hive.spark.client.connect.timeout=1000 --conf
hive.spark.client.server.connect.timeout=9 --conf
hive.spark.client.channel.log.level=null --conf
hive.spark.client.rpc.max.size=52428800 --conf
hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
--conf hive.spark.client.rpc.server.address=null

* i even print out CLASSPATH in this script:
/usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit

and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is in
it.

​so i ask... what am i missing?

thanks,
Stephen​


Re: Hive on Spark

2017-08-22 Thread Vihang Karajgaonkar
Xuefu is planning to give a talk on Hive-on-Spark @Uber the user meetup
this week. We can check if can share the presentation on this list for
folks who can't attend the meetup.

https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/


On Mon, Aug 21, 2017 at 11:44 PM, peter zhang 
wrote:

> Hi All,
> Has anybody used hive on spark in your production environment? How
> does it's the stability and performance compared with spark sql?
> Hope anybody can share your experience.
>
> Thanks in advance!
>


Hive on Spark

2017-08-21 Thread peter zhang
Hi All,
Has anybody used hive on spark in your production environment? How
does it's the stability and performance compared with spark sql?
Hope anybody can share your experience.

Thanks in advance!


?????? hive on spark - version question

2017-03-18 Thread yuxh
I meet the same problem,it seems JavaSparkListener has been delete in  spark 2. 
But I see someone using hive 1.2.1 with spark 2 is ok. I haven't try yet. 




--  --
??: "Stephen Sprague"; 
: 2017??3??18??(??) 2:33
??: "user@hive.apache.org"; 
: Re: hive on spark - version question



:(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work with 
Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker to me, 
alas.


thanks in advance.


Cheers,

Stephen.



On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague  wrote:
hi guys,

wondering where we stand with Hive On Spark these days?


i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental versions) 
and running up against this class not found:

java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener



searching the Cyber i find this:
1. 
http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive


which pretty much describes my situation too and it references this:



2. https://issues.apache.org/jira/browse/SPARK-17563



which indicates a "won't fix" - but does reference this:



3. https://issues.apache.org/jira/browse/HIVE-14029


which looks to be fixed in hive 2.2 - which is not released yet.



so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?


thanks,

Stephen.

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague
yeah but... is the glass half-full or half-empty?  sure this might suck but
keep your head high, bro! Lots of it (hive) does work. :)


On Fri, Mar 17, 2017 at 2:25 PM, hernan saab 
wrote:

> Stephan,
>
> Thanks for the response.
>
> The one thing that I don't appreciate from those who promote and DOCUMENT
> spark on hive is that, seemingly, there is absolutely no evidence seen that
> says that hive on spark WORKS.
> As a matter of fact, after a lot of pain, I noticed it is not supported by
> just about anybody.
>
> If someone dares to document Hive on Spark (see link
> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started)  why can't they have the decency to mention
> what specific combo of Hadoop/Spark/Hive versions used that works? Have a
> git repo included in a doc with all the right versions and libraries. Why
> not? We can start from there and progressively use newer libraries in case
> the doc becomes stale. I am not really asking much, I just want to know
> what the documenter used to claim that Hive on Spark works, that's it.
>
> Clearly, for most cases, this setup is broken and it misleads people to
> waste time on a broken setup.
>
> I love this tech. But I do notice that there is some mean spirited or very
> negligent actions made by the apache development community. Documenting
> hive on spark while knowing it won't work for most cases means apache
> developers don't give a crap about the time wasted by people like us.
>
>
>
>
> On Friday, March 17, 2017 1:14 PM, Edward Capriolo 
> wrote:
>
>
>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  > wrote:
>
> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> <https://issues.apache.org/jira/browse/SPARK-17563>
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
> <https://issues.apache.org/jira/browse/HIVE-14029>
>
> which looks to be fixed in hiv

Re: hive on spark - version question

2017-03-17 Thread hernan saab
Stephan,
Thanks for the response.
The one thing that I don't appreciate from those who promote and DOCUMENT spark 
on hive is that, seemingly, there is absolutely no evidence seen that says that 
hive on spark WORKS. As a matter of fact, after a lot of pain, I noticed it is 
not supported by just about anybody.
If someone dares to document Hive on Spark (see link 
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
  why can't they have the decency to mention what specific combo of 
Hadoop/Spark/Hive versions used that works? Have a git repo included in a doc 
with all the right versions and libraries. Why not? We can start from there and 
progressively use newer libraries in case the doc becomes stale. I am not 
really asking much, I just want to know what the documenter used to claim that 
Hive on Spark works, that's it.
Clearly, for most cases, this setup is broken and it misleads people to waste 
time on a broken setup.
I love this tech. But I do notice that there is some mean spirited or very 
negligent actions made by the apache development community. Documenting hive on 
spark while knowing it won't work for most cases means apache developers don't 
give a crap about the time wasted by people like us.

 

On Friday, March 17, 2017 1:14 PM, Edward Capriolo  
wrote:
 

 

On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  
wrote:

I have been in a similar world of pain. Basically, I tried to use an external 
Hive to have user access controls with a spark engine.At the end, I realized 
that it was a better idea to use apache tez instead of a spark engine for my 
particular case.
But the journey is what I want to share with you.The big data apache tools and 
libraries such as Hive, Tez, Spark, Hadoop , Parquet etc etc are not 
interchangeable as we would like to think. There are very limited combinations 
for very specific versions. This is why tools like Ambari can be useful. Ambari 
sets a path of combos of versions known to work and the dirty work is done 
under the UI. 
More often than not, when you try a version that few people tried, you will get 
error messages that will derailed you and cause you to waste a lot of time.
In addition, this group, as well as many other apache big data user groups,  
provides extremely poor support for users. The answers you usually get are not 
even hints to a solution. Their answers usually translate to "there is nothing 
I am willing to do about your problem. If I did, I should get paid" in many 
cryptic ways.
If you ask your question to the Spark group they will take you to the Hive 
group and viceversa (I can almost guarantee it based on previous experiences)
But in hindsight, people who work on this kinds of things typically make more 
money that the average developers. If you make more $$s it makes sense learning 
this stuff is supposed to be harder.
Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if you 
are querying large files.
 

On Friday, March 17, 2017 11:33 AM, Stephen Sprague  
wrote:
 

 :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work 
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker to 
me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague  wrote:

hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental versions) 
and running up against this class not found:

java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener


searching the Cyber i find this:
    1. http://stackoverflow.com/ questions/41953688/setting- 
spark-as-default-execution- engine-for-hive

    which pretty much describes my situation too and it references this:


    2. https://issues.apache.org/ jira/browse/SPARK-17563

    which indicates a "won't fix" - but does reference this:


    3. https://issues.apache.org/ jira/browse/HIVE-14029

    which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.





   

Stephan,  
I understand some of your frustration.  Remember that many in open source are 
volunteering their time. This is why if you pay a vendor for support of some 
software you might pay 50K a year or $200.00 an hour. If I was your 
vendor/consultant I would have started the clock 10 minutes ago just to answer 
this email :). The only "pay" I ever got from Hive is that I can use it as a 
resume bullet point, and I wrote a book which pays me royalties.
As it relates specifically to your problem, when you see the trends you are 
seeing it probably means you are in a minority of the user base. Either your 
doing something no one else is doing, you are too cutting edge, or no one has 
an easy solution. Hive is making the move from the classic MapReduce, two 

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague
thanks for the comments and for sure all relevant. And yeah I feel the pain
just like the next guy but that's the part of the opensource "life style"
you subscribe to when using it.

The upside payoff has gotta be worth the downside risk - or else forget
about it right? Here in the Hive world in my experience anyway its been
great.  Gotta roll with it, be courteous, be persistent and sometimes
things just work out.

Getting back to Spark and Tez yes by all means i'm a big Tez user aleady so
i was hoping to see what Spark brought to table and i didn't want to diddle
around with Spark < 2.0.   That's cool. I can live with that not being
nailed down yet. I'll just wait for hive 2.2 and rattle the cage again! ha!


All good!

Cheers,
Stephen.

On Fri, Mar 17, 2017 at 1:14 PM, Edward Capriolo 
wrote:

>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  > wrote:
>
>> I have been in a similar world of pain. Basically, I tried to use an
>> external Hive to have user access controls with a spark engine.
>> At the end, I realized that it was a better idea to use apache tez
>> instead of a spark engine for my particular case.
>>
>> But the journey is what I want to share with you.
>> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop
>> , Parquet etc etc are not interchangeable as we would like to think. There
>> are very limited combinations for very specific versions. This is why tools
>> like Ambari can be useful. Ambari sets a path of combos of versions known
>> to work and the dirty work is done under the UI.
>>
>> More often than not, when you try a version that few people tried, you
>> will get error messages that will derailed you and cause you to waste a lot
>> of time.
>>
>> In addition, this group, as well as many other apache big data user
>> groups,  provides extremely poor support for users. The answers you usually
>> get are not even hints to a solution. Their answers usually translate to
>> "there is nothing I am willing to do about your problem. If I did, I should
>> get paid" in many cryptic ways.
>>
>> If you ask your question to the Spark group they will take you to the
>> Hive group and viceversa (I can almost guarantee it based on previous
>> experiences)
>>
>> But in hindsight, people who work on this kinds of things typically make
>> more money that the average developers. If you make more $$s it makes sense
>> learning this stuff is supposed to be harder.
>>
>> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
>> you are querying large files.
>>
>>
>>
>> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
>> wrote:
>>
>>
>> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will
>> work with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal
>> breaker to me, alas.
>>
>> thanks in advance.
>>
>> Cheers,
>> Stephen.
>>
>> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
>> wrote:
>>
>> hi guys,
>> wondering where we stand with Hive On Spark these days?
>>
>> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
>> versions) and running up against this class not found:
>>
>> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>>
>>
>> searching the Cyber i find this:
>> 1. http://stackoverflow.com/ questions/41953688/setting-
>> spark-as-default-execution- engine-for-hive
>> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>>
>> which pretty much describes my situation too and it references this:
>>
>>
>> 2. https://issues.apache.org/ jira/browse/SPARK-17563
>> <https://issues.apache.org/jira/browse/SPARK-17563>
>>
>> which indicates a "won't fix" - but does reference this:
>>
>>
>> 3. https://issues.apache.org/ jira/browse/HIVE-14029
>> <https://issues.apache.org/jira/browse/HIVE-14029>
>>
>> which looks to be fixed in hive 2.2 - which is not released yet.
>>
>>
>> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
>> 2.2?
>>
>> thanks,
>> Stephen.
>>
>>
>>
>>
>>
> Stephan,
>
> I understand some of your frustration.  Remember that many in open source
> are volunteering their time. This is why if you pay a vendor for support of
> some software you might pay 50K a year or $200.00 an hour. If I was your
> vendor/consultant I would have starte

Re: hive on spark - version question

2017-03-17 Thread Edward Capriolo
On Fri, Mar 17, 2017 at 2:56 PM, hernan saab 
wrote:

> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> <https://issues.apache.org/jira/browse/SPARK-17563>
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
> <https://issues.apache.org/jira/browse/HIVE-14029>
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>
>
>
>
Stephan,

I understand some of your frustration.  Remember that many in open source
are volunteering their time. This is why if you pay a vendor for support of
some software you might pay 50K a year or $200.00 an hour. If I was your
vendor/consultant I would have started the clock 10 minutes ago just to
answer this email :). The only "pay" I ever got from Hive is that I can use
it as a resume bullet point, and I wrote a book which pays me royalties.

As it relates specifically to your problem, when you see the trends you are
seeing it probably means you are in a minority of the user base. Either
your doing something no one else is doing, you are too cutting edge, or no
one has an easy solution. Hive is making the move from the classic
MapReduce, two other execution engines have been made Tez and HiveOnSpark.
Because we are open source we allow people to "scratch an itch" that is the
Apache way. From time to time in means something that was added stops being
viable because of lack of support.

I agree with your final assessment which is Tez is the most viable engine
for Hive. This is by no means a put down of the HiveOnSpark work and it
does not mean it will never the most viable. By the same token if the
versions fall out of sync and all that exists is complains the viability
speaks for itself.

Remember that keeping two fast moving things together is no easy chore. I
used to run Hive + cassandra. Seems easy, crap two versions of common CLI,
shade one version everything works, crap new hive release has different
versions of thrift, shade + patch, crap now one of the other dependencies
is incompatible fork + shade + patch. At some point you have to say to
yourself if I can not make critical mass of this solution such that I am
the only one doing/patching it then I give up and find some other way to do
it.


Re: hive on spark - version question

2017-03-17 Thread hernan saab
I have been in a similar world of pain. Basically, I tried to use an external 
Hive to have user access controls with a spark engine.At the end, I realized 
that it was a better idea to use apache tez instead of a spark engine for my 
particular case.
But the journey is what I want to share with you.The big data apache tools and 
libraries such as Hive, Tez, Spark, Hadoop , Parquet etc etc are not 
interchangeable as we would like to think. There are very limited combinations 
for very specific versions. This is why tools like Ambari can be useful. Ambari 
sets a path of combos of versions known to work and the dirty work is done 
under the UI. 
More often than not, when you try a version that few people tried, you will get 
error messages that will derailed you and cause you to waste a lot of time.
In addition, this group, as well as many other apache big data user groups,  
provides extremely poor support for users. The answers you usually get are not 
even hints to a solution. Their answers usually translate to "there is nothing 
I am willing to do about your problem. If I did, I should get paid" in many 
cryptic ways.
If you ask your question to the Spark group they will take you to the Hive 
group and viceversa (I can almost guarantee it based on previous experiences)
But in hindsight, people who work on this kinds of things typically make more 
money that the average developers. If you make more $$s it makes sense learning 
this stuff is supposed to be harder.
Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if you 
are querying large files.
 

On Friday, March 17, 2017 11:33 AM, Stephen Sprague  
wrote:
 

 :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work 
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker to 
me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague  wrote:

hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental versions) 
and running up against this class not found:

java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener


searching the Cyber i find this:
    1. http://stackoverflow.com/ questions/41953688/setting- 
spark-as-default-execution- engine-for-hive

    which pretty much describes my situation too and it references this:


    2. https://issues.apache.org/ jira/browse/SPARK-17563

    which indicates a "won't fix" - but does reference this:


    3. https://issues.apache.org/ jira/browse/HIVE-14029

    which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.





   

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague
:(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
to me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
wrote:

> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/questions/41953688/setting-
> spark-as-default-execution-engine-for-hive
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/jira/browse/SPARK-17563
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/jira/browse/HIVE-14029
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>


hive on spark - version question

2017-03-13 Thread Stephen Sprague
hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
versions) and running up against this class not found:

java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener


searching the Cyber i find this:
1.
http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive

which pretty much describes my situation too and it references this:


2. https://issues.apache.org/jira/browse/SPARK-17563

which indicates a "won't fix" - but does reference this:


3. https://issues.apache.org/jira/browse/HIVE-14029

which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.


Re: Need inputs on configuring hive timeout + hive on spark : Job hasn't been submitted after 61s. Aborting it.

2017-02-18 Thread Ian Cook
Naresh,

The properties hive.spark.job.monitor.timeout and hive.spark.client.server.
connect.timeout in hive-site.xml control Hive on Spark timeouts. Details at
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Spark

Ian Cook
Cloudera

On Thu, Feb 16, 2017 at 2:24 PM, naresh gundla 
wrote:

> Hello,
>
>
> i am facing this issue "Job hasn't been submitted after 61s. Aborting it."
> when i am running multiple hive queries.
>
> Details: (Hive on Spark)
> I am using spark dynamic allocation and external shuffle service (yarn)
>
> Assume one queries is using all of the resources in the cluster and when
> the new querie launched then it throws with this error in hive log
>
> 2017-02-16 06:12:59,166 INFO  [main]: status.SparkJobMonitor
> (RemoteSparkJobMonitor.java:startMonitor(67)) -* Job hasn't been
> submitted after 61s. Aborting it.*
> 2017-02-16 06:12:59,166 ERROR [main]: status.SparkJobMonitor
> (SessionState.java:printError(960)) - Status: SENT
> 2017-02-16 06:12:59,167 INFO  [main]: log.PerfLogger
> (PerfLogger.java:PerfLogEnd(148)) -  start=1487254318158 end=1487254379167 duration=61009
> from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor>
> 2017-02-16 06:12:59,183 ERROR [main]: ql.Driver
> (SessionState.java:printError(960)) - FAILED: Execution Error, return
> code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
> 2017-02-16 06:12:59,184 INFO  [main]: log.PerfLogger
> (PerfLogger.java:PerfLogEnd(148)) -  start=1487254317999 end=1487254379184 duration=61185
> from=org.apache.hadoop.hive.ql.Driver>
> 2017-02-16 06:12:59,184 INFO  [main]: log.PerfLogger
> (PerfLogger.java:PerfLogBegin(121)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2017-02-16 06:12:59,184 INFO  [main]: log.PerfLogger
> (PerfLogger.java:PerfLogEnd(148)) -  start=1487254379184 end=1487254379184 duration=0
> from=org.apache.hadoop.hive.ql.Driver>
> 2017-02-16 06:12:59,201 INFO  [main]: log.PerfLogger
> (PerfLogger.java:PerfLogBegin(121)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2017-02-16 06:12:59,202 INFO  [main]: log.PerfLogger
> (PerfLogger.java:PerfLogEnd(148)) -  start=1487254379201 end=1487254379202 duration=1
> from=org.apache.hadoop.hive.ql.Driver>
>
> Is there any parameter to config , that the the query should wait until it
> get the requried resources and it should not fail.
>
>
> Thanks,
> Naresh
>


Need inputs on configuring hive timeout + hive on spark : Job hasn't been submitted after 61s. Aborting it.

2017-02-16 Thread naresh gundla
Hello,


i am facing this issue "Job hasn't been submitted after 61s. Aborting it."
when i am running multiple hive queries.

Details: (Hive on Spark)
I am using spark dynamic allocation and external shuffle service (yarn)

Assume one queries is using all of the resources in the cluster and when
the new querie launched then it throws with this error in hive log

2017-02-16 06:12:59,166 INFO  [main]: status.SparkJobMonitor
(RemoteSparkJobMonitor.java:startMonitor(67)) -* Job hasn't been submitted
after 61s. Aborting it.*
2017-02-16 06:12:59,166 ERROR [main]: status.SparkJobMonitor
(SessionState.java:printError(960)) - Status: SENT
2017-02-16 06:12:59,167 INFO  [main]: log.PerfLogger
(PerfLogger.java:PerfLogEnd(148)) - 
2017-02-16 06:12:59,183 ERROR [main]: ql.Driver
(SessionState.java:printError(960)) - FAILED: Execution Error, return code
2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
2017-02-16 06:12:59,184 INFO  [main]: log.PerfLogger
(PerfLogger.java:PerfLogEnd(148)) - 
2017-02-16 06:12:59,184 INFO  [main]: log.PerfLogger
(PerfLogger.java:PerfLogBegin(121)) - 
2017-02-16 06:12:59,184 INFO  [main]: log.PerfLogger
(PerfLogger.java:PerfLogEnd(148)) - 
2017-02-16 06:12:59,201 INFO  [main]: log.PerfLogger
(PerfLogger.java:PerfLogBegin(121)) - 
2017-02-16 06:12:59,202 INFO  [main]: log.PerfLogger
(PerfLogger.java:PerfLogEnd(148)) - 

Is there any parameter to config , that the the query should wait until it
get the requried resources and it should not fail.


Thanks,
Naresh


hive on spark ,three tables(one is small, others are big),cannot go mapjoin

2017-01-03 Thread Maria
Hi,all
   I have a  doubt:
my test hql is :
"select tmp.src_ip,c.to_ip from (select a.src_ip,b.appid from small_tbl a join 
im b on a.src_ip=b.src_ip) tmp join email c on tmp.appid=c.appid" , im and 
email are bigtables, 
set hive.execution.engine=mr, the execution plan generated two mapjoin stage, 
set  hive.execution.engine=spark,the execution plan generated one map join and 
one common join ,this is to say
"(select a.src_ip,b.appid from small_tbl a join im b on a.src_ip=b.src_ip)" go 
mapjoin ,and its result "tmp" has only 10 item, BUT "tmp join email" cannot go 
mapjoin...... 
and I DEBUG the code,:


in hive-on-spark:
(1)(select a.src_ip,b.appid from small_tbl a join im b on a.src_ip=b.src_ip) 
->>>  MapWork.getMapredLocalWork() is OK,there is one 
MapRedLocalWork Object
(2) the result of the previous stage named ‘tmp’ join email, 
MapWork.getMapredLocalWork() is null.


Why hive on spark can not go mapjoin in this case, thankyou...







Re: please give me the permission to update the wiki of hive on spark

2017-01-03 Thread Lefty Leverenz
Done.  Welcome to the Hive wiki team, Kelly, and happy new year!

-- Lefty


On Mon, Jan 2, 2017 at 5:40 PM, Zhang, Liyun  wrote:

> Hi
>
>   I want to update wiki<https://cwiki.apache.org/
> confluence/display/Hive/Hive+on+Spark%3A+Getting+Started> of hive on
> spark because HIVE-8373, my  Confluence
> <https://cwiki.apache.org/confluence/signup.action> username is kellyzly,
> please provide the privilege to me to update wiki.
>
>
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>


please give me the permission to update the wiki of hive on spark

2017-01-02 Thread Zhang, Liyun
Hi

  I want to update 
wiki<https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started>
 of hive on spark because HIVE-8373, my  
Confluence<https://cwiki.apache.org/confluence/signup.action> username is 
kellyzly, please provide the privilege to me to update wiki.


Best Regards
Kelly Zhang/Zhang,Liyun



please give me the permission to update the wiki of hive on spark

2017-01-02 Thread Zhang, Liyun
Hi

  I want to update 
wiki<https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started>
 of hive on spark because HIVE-8373, my  
Confluence<https://cwiki.apache.org/confluence/signup.action> username is 
kellyzly, please provide the privilege to me to update wiki.


Best Regards
Kelly Zhang/Zhang,Liyun



RE: When Hive on Spark will support Spark 2.0?

2016-12-07 Thread Joaquin Alzola
The version that will support Spark2.0 is Hive2.2

No not know yet when this is going to be release.

-Original Message-
From: baipeng [mailto:b...@meitu.com]
Sent: 07 December 2016 08:04
To: user@hive.apache.org
Subject: When Hive on Spark will support Spark 2.0?

Does Anyone know when Hive will release version to support Spark 2.0? Now hive 
2.1.0 only supports spark 1.6.
This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.


When Hive on Spark will support Spark 2.0?

2016-12-07 Thread baipeng
Does Anyone know when Hive will release version to support Spark 2.0? Now hive 
2.1.0 only supports spark 1.6.


RE: Hive on Spark not working

2016-11-29 Thread Joaquin Alzola
Being unable to integrate separately Hive with Spark I just started directly on 
Spark the thrift server.
Now it is working as expected.

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: 29 November 2016 11:12
To: user 
Subject: Re: Hive on Spark not working

Hive on Spark engine only works with Spark 1.3.1.


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 November 2016 at 07:56, Furcy Pin 
mailto:furcy@flaminem.com>> wrote:
ClassNotFoundException generally means that jars are missing from your class 
path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
mailto:joaquin.alz...@lebara.com>> wrote:
Hi Guys

No matter what I do that when I execute “select count(*) from employee” I get 
the following output on the logs:
It is quiet funny because if I put hive.execution.engine=mr the output is 
correct. If I put hive.execution.engine=spark then I get the bellow errors.
If I do the search directly through spark-shell it work great.
+---+
|_c0|
+---+
|1005635|
+---+
So there has to be a problem from hive to spark.

Seems as the RPC(??) connection is not setup …. Can somebody guide me on what 
to look for.
spark.master=spark://172.16.173.31:7077<http://172.16.173.31:7077>
hive.execution.engine=spark
spark.executor.extraClassPath
/mnt/spark/lib/spark-1.6.2-yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar

Hive2.0.1--> Spark 1.6.2 –> Hadoop – 2.6.5 --> Scala 2.10

2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher 
(RpcDispatcher.java:handleError(142)) - Received error 
message:io.netty.handler.codec.DecoderException: 
java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:358)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at 
io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.hiv

RE: Hive on Spark not working

2016-11-29 Thread Joaquin Alzola
HI Mich

I read in some older post that you make it work as well with the configuration 
I have:
Hive2.0.1--> Spark 1.6.2 –> Hadoop – 2.6.5 --> Scala 2.10
You only make it work with Hive 1.2.1 --> Spark 1.3.1 --> etc ….?

BR

Joaquin

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: 29 November 2016 11:12
To: user 
Subject: Re: Hive on Spark not working

Hive on Spark engine only works with Spark 1.3.1.


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 November 2016 at 07:56, Furcy Pin 
mailto:furcy@flaminem.com>> wrote:
ClassNotFoundException generally means that jars are missing from your class 
path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
mailto:joaquin.alz...@lebara.com>> wrote:
Hi Guys

No matter what I do that when I execute “select count(*) from employee” I get 
the following output on the logs:
It is quiet funny because if I put hive.execution.engine=mr the output is 
correct. If I put hive.execution.engine=spark then I get the bellow errors.
If I do the search directly through spark-shell it work great.
+---+
|_c0|
+---+
|1005635|
+---+
So there has to be a problem from hive to spark.

Seems as the RPC(??) connection is not setup …. Can somebody guide me on what 
to look for.
spark.master=spark://172.16.173.31:7077<http://172.16.173.31:7077>
hive.execution.engine=spark
spark.executor.extraClassPath
/mnt/spark/lib/spark-1.6.2-yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar

Hive2.0.1--> Spark 1.6.2 –> Hadoop – 2.6.5 --> Scala 2.10

2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher 
(RpcDispatcher.java:handleError(142)) - Received error 
message:io.netty.handler.codec.DecoderException: 
java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:358)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at 
io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at jav

Re: Hive on Spark not working

2016-11-29 Thread Mich Talebzadeh
Hive on Spark engine only works with Spark 1.3.1.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 November 2016 at 07:56, Furcy Pin  wrote:

> ClassNotFoundException generally means that jars are missing from your
> class path.
>
> You probably need to link the spark jar to $HIVE_HOME/lib
> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive
>
> On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola  > wrote:
>
>> Hi Guys
>>
>>
>>
>> No matter what I do that when I execute “select count(*) from employee” I
>> get the following output on the logs:
>>
>> It is quiet funny because if I put hive.execution.engine=mr the output is
>> correct. If I put hive.execution.engine=spark then I get the bellow errors.
>>
>> If I do the search directly through spark-shell it work great.
>>
>> +---+
>>
>> |_c0|
>>
>> +---+
>>
>> |1005635|
>>
>> +---+
>>
>> So there has to be a problem from hive to spark.
>>
>>
>>
>> Seems as the RPC(??) connection is not setup …. Can somebody guide me on
>> what to look for.
>>
>> spark.master=spark://172.16.173.31:7077
>>
>> hive.execution.engine=spark
>>
>> spark.executor.extraClassPath/mnt/spark/lib/spark-1.6.2-yar
>> n-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar
>>
>>
>>
>> Hive2.0.1à Spark 1.6.2 –> Hadoop – 2.6.5 à Scala 2.10
>>
>>
>>
>> 2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher
>> (RpcDispatcher.java:handleError(142)) - Received error
>> message:io.netty.handler.codec.DecoderException:
>> java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
>>
>> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>> oMessageDecoder.java:358)
>>
>> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>> ToMessageDecoder.java:230)
>>
>> at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteTo
>> MessageCodec.java:103)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:308)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:294)
>>
>> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>> annelInboundHandlerAdapter.java:86)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:308)
>>
>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:294)
>>
>> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>> ultChannelPipeline.java:846)
>>
>> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.
>> read(AbstractNioByteChannel.java:131)
>>
>> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
>> tLoop.java:511)
>>
>> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz
>> ed(NioEventLoop.java:468)
>>
>> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEve
>> ntLoop.java:382)
>>
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>
>> at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(
>> SingleThreadEventExecutor.java:111)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/J
>> ob
>>
>> at java.lang.ClassLoader.defineClass1(Native Method)
>>
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>>
>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>> r.java:142)
>>
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>
>> at java.net.URLClassLoader.access$100(URLCl

Re: Hive on Spark not working

2016-11-28 Thread Furcy Pin
ClassNotFoundException generally means that jars are missing from your
class path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
wrote:

> Hi Guys
>
>
>
> No matter what I do that when I execute “select count(*) from employee” I
> get the following output on the logs:
>
> It is quiet funny because if I put hive.execution.engine=mr the output is
> correct. If I put hive.execution.engine=spark then I get the bellow errors.
>
> If I do the search directly through spark-shell it work great.
>
> +---+
>
> |_c0|
>
> +---+
>
> |1005635|
>
> +---+
>
> So there has to be a problem from hive to spark.
>
>
>
> Seems as the RPC(??) connection is not setup …. Can somebody guide me on
> what to look for.
>
> spark.master=spark://172.16.173.31:7077
>
> hive.execution.engine=spark
>
> spark.executor.extraClassPath/mnt/spark/lib/spark-1.6.2-
> yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar
>
>
>
> Hive2.0.1à Spark 1.6.2 –> Hadoop – 2.6.5 à Scala 2.10
>
>
>
> 2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher
> (RpcDispatcher.java:handleError(142)) - Received error
> message:io.netty.handler.codec.DecoderException: 
> java.lang.NoClassDefFoundError:
> org/apache/hive/spark/client/Job
>
> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(
> ByteToMessageDecoder.java:358)
>
> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(
> ByteToMessageDecoder.java:230)
>
> at io.netty.handler.codec.ByteToMessageCodec.channelRead(
> ByteToMessageCodec.java:103)
>
> at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:308)
>
> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:294)
>
> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(
> ChannelInboundHandlerAdapter.java:86)
>
> at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:308)
>
> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:294)
>
> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(
> DefaultChannelPipeline.java:846)
>
> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(
> AbstractNioByteChannel.java:131)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:511)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:468)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:382)
>
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:111)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/
> Job
>
> at java.lang.ClassLoader.defineClass1(Native Method)
>
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>
> at java.security.SecureClassLoader.defineClass(
> SecureClassLoader.java:142)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at org.apache.hive.com.esotericsoftware.kryo.util.
> DefaultClassResolver.readName(DefaultClassResolver.java:154)
>
> at org.apache.hive.com.esotericsoftware.kryo.util.
> DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>
> at org.apache.hive.com.esotericsoftware.kryo.Kryo.
> readClass(Kryo.java:670)
>
>

Hive on Spark not working

2016-11-28 Thread Joaquin Alzola
Hi Guys

No matter what I do that when I execute "select count(*) from employee" I get 
the following output on the logs:
It is quiet funny because if I put hive.execution.engine=mr the output is 
correct. If I put hive.execution.engine=spark then I get the bellow errors.
If I do the search directly through spark-shell it work great.
+---+
|_c0|
+---+
|1005635|
+---+
So there has to be a problem from hive to spark.

Seems as the RPC(??) connection is not setup  Can somebody guide me on what 
to look for.
spark.master=spark://172.16.173.31:7077
hive.execution.engine=spark
spark.executor.extraClassPath
/mnt/spark/lib/spark-1.6.2-yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar

Hive2.0.1--> Spark 1.6.2 -> Hadoop - 2.6.5 --> Scala 2.10

2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher 
(RpcDispatcher.java:handleError(142)) - Received error 
message:io.netty.handler.codec.DecoderException: 
java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:358)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at 
io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
at 
org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at 
org.apache.hive.spark.client.rpc.KryoMessageCodec.decode(KryoMessageCodec.java:97)
at 
io.netty.handler.codec.ByteToMessageCodec$1.decode(ByteToMessageCodec.java:42)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:327)
... 15 more
Caused by: java.lang.ClassNotFoundException: org.apache.hive.spark.client.Job
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 39 more
This email is confidential and may 

Re: Hive on Spark - Mesos

2016-09-15 Thread Mich Talebzadeh
sorry on Yarn only but I gather it should work with Mesos. I don't think
that comes into it.

The issue is the compatibility of Spark assembly library with Hive.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 September 2016 at 22:41, John Omernik  wrote:

> Did you run it on Mesos? Your presentation doesn't mention Mesos at all...
>
> John
>
>
> On Thu, Sep 15, 2016 at 4:20 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Yes you can. Hive on Spark meaning Hive using Spark as its execution
>> engine works fine. The version that I managed to make it work  is any Hive
>> version> 1,2 with Spark 1.3.1.
>>
>> You  need to build Spark from the source code excluding Hive libraries.
>>
>> Check my attached presentation.
>>
>>  HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 September 2016 at 22:10, John Omernik  wrote:
>>
>>> Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
>>> trying to get it to run on bleeding edge Spark (2.0).
>>>
>>> Spark is working fine, I can query the data all is setup, however, I
>>> can't get Hive on Spark to work. I understand it's not really a thing (Hive
>>> on Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
>>> (I.e. is there some reason why this shouldn't work other than it just
>>> hasn't been attempted?)
>>>
>>> The error I am getting is odd.. (see below) not sure why that would pop
>>> up, everything seems right other wise... any help would be appreciated.
>>>
>>> John
>>>
>>>
>>>
>>>
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>>
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>>
>>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>>> r.java:142)
>>>
>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>>
>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>>
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> at java.lang.Class.forName0(Native Method)
>>>
>>> at java.lang.Class.forName(Class.java:348)
>>>
>>> at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>> $SparkSubmit$$runMain(SparkSubmit.scala:686)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>> .scala:185)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>>>
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.JavaSparkListener
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> ... 20 more
>>>
>>>
>>> at org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcS
>>> erver.java:179)
>>>
>>> at org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClie
>>> ntImpl.java:465)
>>>
>>
>>
>


Re: Hive on Spark - Mesos

2016-09-15 Thread John Omernik
Did you run it on Mesos? Your presentation doesn't mention Mesos at all...

John


On Thu, Sep 15, 2016 at 4:20 PM, Mich Talebzadeh 
wrote:

> Yes you can. Hive on Spark meaning Hive using Spark as its execution
> engine works fine. The version that I managed to make it work  is any Hive
> version> 1,2 with Spark 1.3.1.
>
> You  need to build Spark from the source code excluding Hive libraries.
>
> Check my attached presentation.
>
>  HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 September 2016 at 22:10, John Omernik  wrote:
>
>> Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
>> trying to get it to run on bleeding edge Spark (2.0).
>>
>> Spark is working fine, I can query the data all is setup, however, I
>> can't get Hive on Spark to work. I understand it's not really a thing (Hive
>> on Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
>> (I.e. is there some reason why this shouldn't work other than it just
>> hasn't been attempted?)
>>
>> The error I am getting is odd.. (see below) not sure why that would pop
>> up, everything seems right other wise... any help would be appreciated.
>>
>> John
>>
>>
>>
>>
>> at java.lang.ClassLoader.defineClass1(Native Method)
>>
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>
>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>> r.java:142)
>>
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> at java.lang.Class.forName0(Native Method)
>>
>> at java.lang.Class.forName(Class.java:348)
>>
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
>>
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:686)
>>
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:185)
>>
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.JavaSparkListener
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> ... 20 more
>>
>>
>> at org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcS
>> erver.java:179)
>>
>> at org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClie
>> ntImpl.java:465)
>>
>
>


Hive on Spark - Mesos

2016-09-15 Thread John Omernik
Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
trying to get it to run on bleeding edge Spark (2.0).

Spark is working fine, I can query the data all is setup, however, I can't
get Hive on Spark to work. I understand it's not really a thing (Hive on
Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
(I.e. is there some reason why this shouldn't work other than it just
hasn't been attempted?)

The error I am getting is odd.. (see below) not sure why that would pop up,
everything seems right other wise... any help would be appreciated.

John




at java.lang.ClassLoader.defineClass1(Native Method)

at java.lang.ClassLoader.defineClass(ClassLoader.java:760)

at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)

at java.net.URLClassLoader.access$100(URLClassLoader.java:73)

at java.net.URLClassLoader$1.run(URLClassLoader.java:368)

at java.net.URLClassLoader$1.run(URLClassLoader.java:362)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:361)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:348)

at org.apache.spark.util.Utils$.classForName(Utils.scala:225)

at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.ClassNotFoundException:
org.apache.spark.JavaSparkListener

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 20 more


at
org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:179)

at
org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:465)


Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

2016-09-14 Thread Benjamin Schaff
Hi,

Thanks for the answer.

I am running on a custom build of spark 1.6.2 meaning the one given in the
hive documentation so without hive jars.
I set it up in hive-env.sh.

I created the istari table like in the documentation and I run INSERT on it
then a GROUP BY.
Everything went on spark standalone cluster correctly not exception nowhere.

Do you have any other suggestion ?

Thanks.

Le mer. 14 sept. 2016 à 13:55, Mich Talebzadeh 
a écrit :

> Hi,
>
> You are using Hive 2. What is the Spark version that runs as Hive
> execution engine?
>
> I cannot see spark.home in your hive-site.xml so I cannot figure it out.
>
> BTW you are using Spark standalone as the mode. I tend to use yarn-client.
>
> Now back to the above issue. Do other queries work OK with Hive on Spark?
>
> Some of those perf parameters can be set up in Hive session itself or
> through init file
>
>  set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
> set spark.master=yarn;
> set spark.deploy.mode=client;
> set spark.executor.memory=8g;
> set spark.driver.memory=8g;
> set spark.executor.instances=6;
> set spark.ui.port=;
>
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 September 2016 at 18:28, Benjamin Schaff 
> wrote:
>
>> Hi,
>>
>> After several days trying to figure out the problem I'm stuck with a
>> class cast exception when running a query with hive on spark on orc tables
>> that I updated with the streaming mutation api of hive 2.0.
>>
>> The context is the following:
>>
>> For hive:
>>
>> The version is the latest available from the website 2.1
>> I created some scala code to insert data into an orc table with the
>> streaming mutation api followed the example provided somewhere in the hive
>> repository.
>>
>> The table looks like that:
>>
>> ++--+
>> |   createtab_stmt   |
>> ++--+
>> | CREATE TABLE `hc__member`( |
>> |   `rdv_core__key` bigint,  |
>> |   `rdv_core__domainkey` string,|
>> |   `rdftypes` array,|
>> |   `rdv_org__firstname` string, |
>> |   `rdv_org__middlename` string,|
>> |   `rdv_org__lastname` string,  |
>> |   `rdv_org__gender` string,|
>> |   `rdv_org__city` string,  |
>> |   `rdv_org__state` string, |
>> |   `rdv_org__countrycode` string,   |
>> |   `rdv_org__addresslabel` string,  |
>> |   `rdv_org__zip` string)   |
>> | CLUSTERED BY ( |
>> |   rdv_core__key)   |
>> | INTO 24 BUCKETS|
>> | ROW FORMAT SERDE   |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  |
>> | STORED AS INPUTFORMAT  |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'|
>> | OUTPUTFORMAT   |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'   |
>> | LOCATION   |
>> |   'hdfs://hmaster:8020/user/hive/warehouse/hc__member' |
>> | TBLPROPERTIES (|
>> |   'COLUMN_STATS_ACCURATE'

Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

2016-09-14 Thread Mich Talebzadeh
Hi,

You are using Hive 2. What is the Spark version that runs as Hive execution
engine?

I cannot see spark.home in your hive-site.xml so I cannot figure it out.

BTW you are using Spark standalone as the mode. I tend to use yarn-client.

Now back to the above issue. Do other queries work OK with Hive on Spark?

Some of those perf parameters can be set up in Hive session itself or
through init file

 set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=8g;
set spark.driver.memory=8g;
set spark.executor.instances=6;
set spark.ui.port=;


HTH








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 18:28, Benjamin Schaff 
wrote:

> Hi,
>
> After several days trying to figure out the problem I'm stuck with a class
> cast exception when running a query with hive on spark on orc tables that I
> updated with the streaming mutation api of hive 2.0.
>
> The context is the following:
>
> For hive:
>
> The version is the latest available from the website 2.1
> I created some scala code to insert data into an orc table with the
> streaming mutation api followed the example provided somewhere in the hive
> repository.
>
> The table looks like that:
>
> ++--+
> |   createtab_stmt   |
> ++--+
> | CREATE TABLE `hc__member`( |
> |   `rdv_core__key` bigint,  |
> |   `rdv_core__domainkey` string,|
> |   `rdftypes` array,|
> |   `rdv_org__firstname` string, |
> |   `rdv_org__middlename` string,|
> |   `rdv_org__lastname` string,  |
> |   `rdv_org__gender` string,|
> |   `rdv_org__city` string,  |
> |   `rdv_org__state` string, |
> |   `rdv_org__countrycode` string,   |
> |   `rdv_org__addresslabel` string,  |
> |   `rdv_org__zip` string)   |
> | CLUSTERED BY ( |
> |   rdv_core__key)   |
> | INTO 24 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'|
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'   |
> | LOCATION   |
> |   'hdfs://hmaster:8020/user/hive/warehouse/hc__member' |
> | TBLPROPERTIES (|
> |   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',|
> |   'compactor.mapreduce.map.memory.mb'='2048',  |
> |   'compactorthreshold.hive.compactor.delta.num.threshold'='4', |
> |   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
> |   'numFiles'='0',  |
> |   'numRows'='0',   |
> |   'rawDataSize'='0',   |
> |   'totalSize'='0', |
> |   'transactional'='true',  |
> |   'transient_lastDdlTime'='14

Hive On Spark - ORC Table - Hive Streaming Mutation API

2016-09-14 Thread Benjamin Schaff
Hi,

After several days trying to figure out the problem I'm stuck with a class
cast exception when running a query with hive on spark on orc tables that I
updated with the streaming mutation api of hive 2.0.

The context is the following:

For hive:

The version is the latest available from the website 2.1
I created some scala code to insert data into an orc table with the
streaming mutation api followed the example provided somewhere in the hive
repository.

The table looks like that:

++--+
|   createtab_stmt   |
++--+
| CREATE TABLE `hc__member`( |
|   `rdv_core__key` bigint,  |
|   `rdv_core__domainkey` string,|
|   `rdftypes` array,|
|   `rdv_org__firstname` string, |
|   `rdv_org__middlename` string,|
|   `rdv_org__lastname` string,  |
|   `rdv_org__gender` string,|
|   `rdv_org__city` string,  |
|   `rdv_org__state` string, |
|   `rdv_org__countrycode` string,   |
|   `rdv_org__addresslabel` string,  |
|   `rdv_org__zip` string)   |
| CLUSTERED BY ( |
|   rdv_core__key)   |
| INTO 24 BUCKETS|
| ROW FORMAT SERDE   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  |
| STORED AS INPUTFORMAT  |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'|
| OUTPUTFORMAT   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'   |
| LOCATION   |
|   'hdfs://hmaster:8020/user/hive/warehouse/hc__member' |
| TBLPROPERTIES (|
|   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',|
|   'compactor.mapreduce.map.memory.mb'='2048',  |
|   'compactorthreshold.hive.compactor.delta.num.threshold'='4', |
|   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
|   'numFiles'='0',  |
|   'numRows'='0',   |
|   'rawDataSize'='0',   |
|   'totalSize'='0', |
|   'transactional'='true',  |
|   'transient_lastDdlTime'='1473792939')|
++--+

The hive site looks like that:


 
hive.execution.engine
spark
  
  
spark.master
spark://hmaster:7077
  
  
spark.eventLog.enabled
false
  
  
spark.executor.memory
12g
  
  
spark.serializer
org.apache.spark.serializer.KryoSerializer
  
  
mapreduce.input.fileinputformat.split.maxsize
75000
  
  
hive.vectorized.execution.enabled
true
  
  
hive.cbo.enable
true
  
  
hive.optimize.reducededuplication.min.reducer
4
  
  
hive.optimize.reducededuplication
true
  
  
hive.orc.splits.include.file.footer
false
  
  
hive.merge.mapfiles
true
  
  
hive.merge.sparkfiles
true
  
  
hive.merge.smallfiles.avgsize
1600
  
  
hive.merge.size.per.task
25600
  
  
hive.merge.orcfile.stripe.level
true
  
  
hive.auto.convert.join
true
  
  
hive.auto.convert.join.noconditionaltask
true
  
  
hive.auto.convert.join.noconditionaltask.size
894435328
  
  
hive.optimize.bucketmapjoin.sortedmerge
false
  
  
hive.map.aggr.hash.percentmemory
0.5
  
  
hive.map.aggr
true
  
  
hive.optimize.sort.dynamic.partition
false
  
  
hive.stats.autogather
true
  
  
hive.stats.fetch.column.stats
true
  
  
hive.vectorized.execution.reduce.enabled
false
  
  
hive.vectorized.groupby.checkinterval
4096
  
  
hive.vectorized.groupby.flush.p

Re: hive on spark job not start enough executors

2016-09-09 Thread 明浩 冯
All the parameters except spark.executor.instances are specified in 
spark-default.conf located in hive's conf folder.  So I think it's a yes.

I also checked on spark's web page when a hive on spark job is running, the 
parameters shown on the web page are exactly what I specified in the config 
file including spark.shuffle.service.enabled and 
spark.dynamicAllocation.enabled.


Should I specify a fixed executor.instances in the file? But it's not good for 
me.


By the way, the data source of my query is parquet files. In hive side I just 
created a external table from the parquet.



Thanks,

Minghao Feng


From: Mich Talebzadeh 
Sent: Friday, September 9, 2016 4:49:55 PM
To: user
Subject: Re: hive on spark job not start enough executors

when you start hive on spark do you set any parameters for the submitted job 
(or read them from init file)?

set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.instances=2;
set spark.ui.port=;


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 9 September 2016 at 09:30, ?? ? 
mailto:qiuff...@hotmail.com>> wrote:

Hi there,


I encountered a problem that makes hive on spark with a very low performance.

I'm using spark 1.6.2 and hive 2.1.0, I specified


spark.shuffle.service.enabledtrue
spark.dynamicAllocation.enabled  true

in my spark-default.conf file (the file is in both spark and hive conf folder) 
to make spark job to get executors dynamically.
The configuration works correctly when I run spark jobs, but when I use hive on 
spark, it only started a few executors although there are more enough cores and 
memories to start more executors.
For example, for the same SQL query, if I run on sparkSQL, it can start more 
than 20 executors, but with hive on spark, only 3.

How can I improve the performance on hive on spark? Any suggestions please.

Thanks,
Minghao Feng




Re: hive on spark job not start enough executors

2016-09-09 Thread Mich Talebzadeh
when you start hive on spark do you set any parameters for the submitted
job (or read them from init file)?

set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.instances=2;
set spark.ui.port=;

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 September 2016 at 09:30, ?? ?  wrote:

> Hi there,
>
>
> I encountered a problem that makes hive on spark with a very low
> performance.
>
> I'm using spark 1.6.2 and hive 2.1.0, I specified
>
>
> spark.shuffle.service.enabledtrue
> spark.dynamicAllocation.enabled  true
>
> in my spark-default.conf file (the file is in both spark and hive conf
> folder) to make spark job to get executors dynamically.
> The configuration works correctly when I run spark jobs, but when I use
> hive on spark, it only started a few executors although there are more
> enough cores and memories to start more executors.
> For example, for the same SQL query, if I run on sparkSQL, it can start
> more than 20 executors, but with hive on spark, only 3.
>
> How can I improve the performance on hive on spark? Any suggestions please.
>
> Thanks,
> Minghao Feng
>
>


hive on spark job not start enough executors

2016-09-09 Thread ?? ?
Hi there,


I encountered a problem that makes hive on spark with a very low performance.

I'm using spark 1.6.2 and hive 2.1.0, I specified


spark.shuffle.service.enabledtrue
spark.dynamicAllocation.enabled  true

in my spark-default.conf file (the file is in both spark and hive conf folder) 
to make spark job to get executors dynamically.
The configuration works correctly when I run spark jobs, but when I use hive on 
spark, it only started a few executors although there are more enough cores and 
memories to start more executors.
For example, for the same SQL query, if I run on sparkSQL, it can start more 
than 20 executors, but with hive on spark, only 3.

How can I improve the performance on hive on spark? Any suggestions please.

Thanks,
Minghao Feng



Re: Hive on spark

2016-08-01 Thread Mich Talebzadeh
Hi,

You can download the pdf from here
<https://talebzadehmich.files.wordpress.com/2016/08/hive_on_spark_only.pdf>

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 03:05, Chandrakanth Akkinepalli <
chandrakanth.akkinepa...@gmail.com> wrote:

> Hi Dr.Mich,
> Can you please share your London meetup presentation. Curious to see the
> comparison according to you of various query engines.
>
> Thanks,
> Chandra
>
> On Jul 28, 2016, at 12:13 AM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I made a presentation in London on 20th July on this subject:. In that I
> explained how to make Spark work as an execution engine for Hive.
>
> Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations
> <http://www.meetup.com/futureofdata-london/events/232423292/>!
>
> See if I can send the presentation
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 04:24, Mudit Kumar  wrote:
>
>> Yes Mich,exactly.
>>
>> Thanks,
>> Mudit
>>
>> From: Mich Talebzadeh 
>> Reply-To: 
>> Date: Thursday, July 28, 2016 at 1:08 AM
>> To: user 
>> Subject: Re: Hive on spark
>>
>> You mean you want to run Hive using Spark as the execution engine which
>> uses Yarn by default?
>>
>>
>> Something like below
>>
>> hive> select max(id) from oraclehadoop.dummy_parquet;
>> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
>> Query Hive on Spark job[1] stages:
>> 2
>> 3
>> Status: Running (Hive on Spark job[1])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId:
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>> [StageCost]
>> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
>> Finished
>> Status: Finished successfully in 13.14 seconds
>> OK
>> 1
>> Time taken: 13.426 seconds, Fetched: 1 row(s)
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 July 2016 at 20:31, Mudit Kumar  wrote:
>>
>>> Hi All,
>>>
>>> I need to configure hive cluster based on spark engine (yarn).
>>> I already have a running hadoop cluster.
>>>
>>> Can someone point me to relevant documentation?
>>>
>>> TIA.
>>>
>>> Thanks,
>>> Mudit
>>>
>>
>>
>


Re: Hive on spark

2016-07-31 Thread Chandrakanth Akkinepalli
Hi Dr.Mich,
Can you please share your London meetup presentation. Curious to see the 
comparison according to you of various query engines.

Thanks,
Chandra

> On Jul 28, 2016, at 12:13 AM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> I made a presentation in London on 20th July on this subject:. In that I 
> explained how to make Spark work as an execution engine for Hive.
> 
> Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations!
> 
> See if I can send the presentation
> 
> Cheers
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 28 July 2016 at 04:24, Mudit Kumar  wrote:
>> Yes Mich,exactly.
>> 
>> Thanks,
>> Mudit
>> 
>> From: Mich Talebzadeh 
>> Reply-To: 
>> Date: Thursday, July 28, 2016 at 1:08 AM
>> To: user 
>> Subject: Re: Hive on spark
>> 
>> You mean you want to run Hive using Spark as the execution engine which uses 
>> Yarn by default?
>> 
>> 
>> Something like below
>> 
>> hive> select max(id) from oraclehadoop.dummy_parquet;
>> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
>> Query Hive on Spark job[1] stages:
>> 2
>> 3
>> Status: Running (Hive on Spark job[1])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
>> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
>> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1 
>> Finished
>> Status: Finished successfully in 13.14 seconds
>> OK
>> 1
>> Time taken: 13.426 seconds, Fetched: 1 row(s)
>> 
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 27 July 2016 at 20:31, Mudit Kumar  wrote:
>>> Hi All,
>>> 
>>> I need to configure hive cluster based on spark engine (yarn).
>>> I already have a running hadoop cluster.
>>> 
>>> Can someone point me to relevant documentation?
>>> 
>>> TIA.
>>> 
>>> Thanks,
>>> Mudit
> 


Re: Hive on spark

2016-07-28 Thread Mudit Kumar
Thanks Guys for the help!

Thanks,
Mudit

From:  Mich Talebzadeh 
Reply-To:  
Date:  Thursday, July 28, 2016 at 9:43 AM
To:  user 
Subject:  Re: Hive on spark

Hi,

I made a presentation in London on 20th July on this subject:. In that I 
explained how to make Spark work as an execution engine for Hive.

Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations! 

See if I can send the presentation 

Cheers


Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

On 28 July 2016 at 04:24, Mudit Kumar  wrote:
Yes Mich,exactly.

Thanks,
Mudit

From:  Mich Talebzadeh 
Reply-To:  
Date:  Thursday, July 28, 2016 at 1:08 AM
To:  user 
Subject:  Re: Hive on spark

You mean you want to run Hive using Spark as the execution engine which uses 
Yarn by default?


Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)


HTH

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

On 27 July 2016 at 20:31, Mudit Kumar  wrote:
Hi All,

I need to configure hive cluster based on spark engine (yarn).
I already have a running hadoop cluster.

Can someone point me to relevant documentation?

TIA.

Thanks,
Mudit





Re: Hive on spark

2016-07-27 Thread Mich Talebzadeh
Hi,

I made a presentation in London on 20th July on this subject:. In that I
explained how to make Spark work as an execution engine for Hive.

Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations
<http://www.meetup.com/futureofdata-london/events/232423292/>!

See if I can send the presentation

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 04:24, Mudit Kumar  wrote:

> Yes Mich,exactly.
>
> Thanks,
> Mudit
>
> From: Mich Talebzadeh 
> Reply-To: 
> Date: Thursday, July 28, 2016 at 1:08 AM
> To: user 
> Subject: Re: Hive on spark
>
> You mean you want to run Hive using Spark as the execution engine which
> uses Yarn by default?
>
>
> Something like below
>
> hive> select max(id) from oraclehadoop.dummy_parquet;
> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 13.14 seconds
> OK
> 1
> Time taken: 13.426 seconds, Fetched: 1 row(s)
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 July 2016 at 20:31, Mudit Kumar  wrote:
>
>> Hi All,
>>
>> I need to configure hive cluster based on spark engine (yarn).
>> I already have a running hadoop cluster.
>>
>> Can someone point me to relevant documentation?
>>
>> TIA.
>>
>> Thanks,
>> Mudit
>>
>
>


Re: Hive on spark

2016-07-27 Thread karthi keyan
mudit,

this link can guide you -
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Thanks,
Karthik

On Thu, Jul 28, 2016 at 8:54 AM, Mudit Kumar  wrote:

> Yes Mich,exactly.
>
> Thanks,
> Mudit
>
> From: Mich Talebzadeh 
> Reply-To: 
> Date: Thursday, July 28, 2016 at 1:08 AM
> To: user 
> Subject: Re: Hive on spark
>
> You mean you want to run Hive using Spark as the execution engine which
> uses Yarn by default?
>
>
> Something like below
>
> hive> select max(id) from oraclehadoop.dummy_parquet;
> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 13.14 seconds
> OK
> 1
> Time taken: 13.426 seconds, Fetched: 1 row(s)
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 July 2016 at 20:31, Mudit Kumar  wrote:
>
>> Hi All,
>>
>> I need to configure hive cluster based on spark engine (yarn).
>> I already have a running hadoop cluster.
>>
>> Can someone point me to relevant documentation?
>>
>> TIA.
>>
>> Thanks,
>> Mudit
>>
>
>


Re: Hive on spark

2016-07-27 Thread Mudit Kumar
Yes Mich,exactly.

Thanks,
Mudit

From:  Mich Talebzadeh 
Reply-To:  
Date:  Thursday, July 28, 2016 at 1:08 AM
To:  user 
Subject:  Re: Hive on spark

You mean you want to run Hive using Spark as the execution engine which uses 
Yarn by default?


Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)


HTH

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

On 27 July 2016 at 20:31, Mudit Kumar  wrote:
Hi All,

I need to configure hive cluster based on spark engine (yarn).
I already have a running hadoop cluster.

Can someone point me to relevant documentation?

TIA.

Thanks,
Mudit




Re: Hive on spark

2016-07-27 Thread Mich Talebzadeh
You mean you want to run Hive using Spark as the execution engine which
uses Yarn by default?


Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 July 2016 at 20:31, Mudit Kumar  wrote:

> Hi All,
>
> I need to configure hive cluster based on spark engine (yarn).
> I already have a running hadoop cluster.
>
> Can someone point me to relevant documentation?
>
> TIA.
>
> Thanks,
> Mudit
>


Hive on spark

2016-07-27 Thread Mudit Kumar
Hi All,

I need to configure hive cluster based on spark engine (yarn).
I already have a running hadoop cluster.

Can someone point me to relevant documentation?

TIA.

Thanks,
Mudit



Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Ashok Kumar
Thanks Mich looking forward to it :) 

On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh 
 wrote:
 

 Hi all,
This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for 
refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, 
Jubilee Line 
If you wish you can register and get more info here
It will be in La Tasca West India Docks Road E14 
and especially if you like Spanish food :)
Regards,



Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 15 July 2016 at 11:06, Joaquin Alzola  wrote:

It is on the 20th (Wednesday) next week. From: Marco Mistroni 
[mailto:mmistr...@gmail.com]
Sent: 15 July 2016 11:04
To: Mich Talebzadeh 
Cc: user @spark ; user 
Subject: Re: Presentation in London: Running Spark on Hive or Hive on Spark Dr 
Mich  do you have any slides or videos available for the presentation you did 
@Canary Wharf?kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich 
Talebzadeh  wrote:
Dear forum members I will be presenting on the topic of "Running Spark on Hive 
or Hive on Spark, your mileage varies" in Future of Data: London 
DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 
PM Place: LondonLocation: One Canada Square, Canary Wharf,  London E14 
5AB.Nearest Underground:  Canary Warf (map)If you are interested please 
register hereLooking forward to seeing those who can make it to have an 
interesting discussion and leverage your experience.Regards,
Dr Mich Talebzadeh LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any 
and all responsibility for any loss, damage or destruction of data or any other 
property which may arise from relying on this email's technical content is 
explicitly disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction. 
 This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.



  

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Mich Talebzadeh
Hi all,

This will be in London tomorrow Wednesday 20th July starting at 18:00 hour
for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf
Station, Jubilee Line

If you wish you can register and get more info here
<http://www.meetup.com/futureofdata-london/>

It will be in La Tasca West India Docks Road E14
<http://www.meetup.com/futureofdata-london/events/232423292/>

and especially if you like Spanish food :)

Regards,




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 July 2016 at 11:06, Joaquin Alzola  wrote:

> It is on the 20th (Wednesday) next week.
>
>
>
> *From:* Marco Mistroni [mailto:mmistr...@gmail.com]
> *Sent:* 15 July 2016 11:04
> *To:* Mich Talebzadeh 
> *Cc:* user @spark ; user 
> *Subject:* Re: Presentation in London: Running Spark on Hive or Hive on
> Spark
>
>
>
> Dr Mich
>
>   do you have any slides or videos available for the presentation you did
> @Canary Wharf?
>
> kindest regards
>
>  marco
>
>
>
> On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Dear forum members
>
>
>
> I will be presenting on the topic of "Running Spark on Hive or Hive on
> Spark, your mileage varies" in Future of Data: London
> <http://www.meetup.com/futureofdata-london/events/232423292/>
>
> *Details*
>
> *Organized by: Hortonworks <http://hortonworks.com/>*
>
> *Date: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM *
>
> *Place: London*
>
> *Location: One Canada Square, Canary Wharf,  London E14 5AB.*
>
> *Nearest Underground:  Canary Warf (map
> <https://maps.google.com/maps?f=q&hl=en&q=One+Canada+Square%2C+Canary+Wharf%2C+E14+5AB%2C+London%2C+gb>)
> *
>
> If you are interested please register here
> <http://www.meetup.com/futureofdata-london/events/232423292/>
>
> Looking forward to seeing those who can make it to have an interesting
> discussion and leverage your experience.
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>


Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-08 Thread mylisttech
Hi Mich,

Would it be on YouTube , post session ?

- Harmeet



On Jul 7, 2016, at 3:07, Mich Talebzadeh  wrote:

> Dear forum members
> 
> I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, 
> your mileage varies" in Future of Data: London 
> 
> Details
> 
> Organized by: Hortonworks
> 
> Date: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM 
> 
> Place: London
> 
> Location: One Canada Square, Canary Wharf,  London E14 5AB.
> 
> Nearest Underground:  Canary Warf (map)
> 
> If you are interested please register here
> 
> Looking forward to seeing those who can make it to have an interesting 
> discussion and leverage your experience.
> 
> Regards,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-07 Thread Ashok Kumar
Thanks.
Will this presentation recorded as well?
Regards 

On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh 
 wrote:
 

 Dear forum members
I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, 
your mileage varies" in Future of Data: London DetailsOrganized by: 
HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: 
LondonLocation: One Canada Square, Canary Wharf,  London E14 5AB.Nearest 
Underground:  Canary Warf (map) If you are interested please register 
hereLooking forward to seeing those who can make it to have an interesting 
discussion and leverage your experience.Regards,
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  

  

Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-06 Thread Mich Talebzadeh
Dear forum members

I will be presenting on the topic of "Running Spark on Hive or Hive on
Spark, your mileage varies" in Future of Data: London
<http://www.meetup.com/futureofdata-london/events/232423292/>

*Details*

*Organized by: Hortonworks <http://hortonworks.com/>*

*Date: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM *

*Place: London*

*Location: One Canada Square, Canary Wharf,  London E14 5AB.*

*Nearest Underground:  Canary Warf (map
<https://maps.google.com/maps?f=q&hl=en&q=One+Canada+Square%2C+Canary+Wharf%2C+E14+5AB%2C+London%2C+gb>)
*

If you are interested please register here
<http://www.meetup.com/futureofdata-london/events/232423292/>

Looking forward to seeing those who can make it to have an interesting
discussion and leverage your experience.
Regards,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Hive on Spark issues with Hive-XML-Serde

2016-06-23 Thread yeshwanth kumar
Hi

we are using Cloudera 5.7.0

there's a use case to process XML data,
we are using the https://github.com/dvasilen/Hive-XML-SerDe

XML serde is working  with Hive execution engine as Map-Reduce,

we enabled Hive on Spark  to test the performance, and we are  facing
following issue

16/06/23 12:47:45 INFO executor.CoarseGrainedExecutorBackend: Got
assigned task 3
16/06/23 12:47:45 INFO executor.Executor: Running task 0.3 in stage 0.0 (TID 3)
16/06/23 12:47:45 INFO rdd.HadoopRDD: Input split:
Paths:/tmp/STYN/data/1040_274316329.xml:0+7406,/tmp/STYN/data/1040__274316331.xml:0+7496InputFormatClass:
com.ibm.spss.hive.serde2.xml.XmlInputFormat

16/06/23 12:47:45 INFO exec.Utilities: PLAN PATH =
hdfs://devcdh/tmp/hive/yesh/c9554491-f58c-4472-b3c5-f47eb5722dd4/hive_2016-06-23_12-47-29_259_4396208623700590328-9/-mr-10003/c79302c5-6f16-4887-85b4-67e781a9ed97/map.xml
16/06/23 12:47:45 ERROR executor.Executor: Exception in task 0.3 in
stage 0.0 (TID 3)
java.io.IOException: java.lang.reflect.InvocationTargetException
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:265)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:212)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:332)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:721)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251)
... 18 more
Caused by: java.io.IOException: CombineHiveRecordReader: class not
found com.ibm.spss.hive.serde2.xml.XmlInputFormat
at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:55)
... 23 more


 i did following steps to ensure that the XML Serde is in Hive class path

   -   configured hive aux jars path, in Cloudera Manager
   - Manually copied jar to all the nodes

i am unable to figure out the issue here,

Any pointers would be a great help


Thanks,
-Yeshwanth


Re: Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh
Thanks Ted,

More interested in general availability of Hive 2 on Spark 1.6 engine as
opposed to Vendors specific custom built.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 March 2016 at 23:55, Ted Yu  wrote:

> According to:
>
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_HDP_RelNotes/bk_HDP_RelNotes-20151221.pdf
>
> Spark 1.5.2 comes out of box.
>
> Suggest moving questions on HDP to Hortonworks forum.
>
> Cheers
>
> On Sat, Mar 26, 2016 at 3:32 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Jorn.
>>
>> Just to be clear they get Hive working with Spark 1.6 out of the box
>> (binary download)? The usual work-around is to build your own package and
>> get the Hadoop-assembly jar file copied over to $HIVE_HOME/lib.
>>
>>
>> Cheers
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 26 March 2016 at 22:08, Jörn Franke  wrote:
>>
>>> If you check the newest Hortonworks distribution then you see that it
>>> generally works. Maybe you can borrow some of their packages. Alternatively
>>> it should be also available in other distributions.
>>>
>>> On 26 Mar 2016, at 22:47, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign
>>> that Hive can utilise a Spark engine higher than 1.3.1
>>>
>>> My understanding was that there were miss-match on Hadoop assembly Jar
>>> files that cause Hive not being able to run on Spark using the binary
>>> downloads. I just tried Hive 2 on Spark 1.6 as the execution engine and it
>>> crashed.
>>>
>>> I do not know the development state of this cross-breed but will be very
>>> desirable if we could manage to sort out
>>> this spark-assembly-1.x.1-hadoop2.4.0.jar for once.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>
>


Re: Hive on Spark engine

2016-03-26 Thread Ted Yu
According to:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_HDP_RelNotes/bk_HDP_RelNotes-20151221.pdf

Spark 1.5.2 comes out of box.

Suggest moving questions on HDP to Hortonworks forum.

Cheers

On Sat, Mar 26, 2016 at 3:32 PM, Mich Talebzadeh 
wrote:

> Thanks Jorn.
>
> Just to be clear they get Hive working with Spark 1.6 out of the box
> (binary download)? The usual work-around is to build your own package and
> get the Hadoop-assembly jar file copied over to $HIVE_HOME/lib.
>
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 26 March 2016 at 22:08, Jörn Franke  wrote:
>
>> If you check the newest Hortonworks distribution then you see that it
>> generally works. Maybe you can borrow some of their packages. Alternatively
>> it should be also available in other distributions.
>>
>> On 26 Mar 2016, at 22:47, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign
>> that Hive can utilise a Spark engine higher than 1.3.1
>>
>> My understanding was that there were miss-match on Hadoop assembly Jar
>> files that cause Hive not being able to run on Spark using the binary
>> downloads. I just tried Hive 2 on Spark 1.6 as the execution engine and it
>> crashed.
>>
>> I do not know the development state of this cross-breed but will be very
>> desirable if we could manage to sort out
>> this spark-assembly-1.x.1-hadoop2.4.0.jar for once.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>


Re: Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh
Thanks Jorn.

Just to be clear they get Hive working with Spark 1.6 out of the box
(binary download)? The usual work-around is to build your own package and
get the Hadoop-assembly jar file copied over to $HIVE_HOME/lib.


Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 March 2016 at 22:08, Jörn Franke  wrote:

> If you check the newest Hortonworks distribution then you see that it
> generally works. Maybe you can borrow some of their packages. Alternatively
> it should be also available in other distributions.
>
> On 26 Mar 2016, at 22:47, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign
> that Hive can utilise a Spark engine higher than 1.3.1
>
> My understanding was that there were miss-match on Hadoop assembly Jar
> files that cause Hive not being able to run on Spark using the binary
> downloads. I just tried Hive 2 on Spark 1.6 as the execution engine and it
> crashed.
>
> I do not know the development state of this cross-breed but will be very
> desirable if we could manage to sort out
> this spark-assembly-1.x.1-hadoop2.4.0.jar for once.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>


Re: Hive on Spark engine

2016-03-26 Thread Jörn Franke
If you check the newest Hortonworks distribution then you see that it generally 
works. Maybe you can borrow some of their packages. Alternatively it should be 
also available in other distributions.

> On 26 Mar 2016, at 22:47, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign that 
> Hive can utilise a Spark engine higher than 1.3.1
> 
> My understanding was that there were miss-match on Hadoop assembly Jar files 
> that cause Hive not being able to run on Spark using the binary downloads. I 
> just tried Hive 2 on Spark 1.6 as the execution engine and it crashed.
> 
> I do not know the development state of this cross-breed but will be very 
> desirable if we could manage to sort out this 
> spark-assembly-1.x.1-hadoop2.4.0.jar for once.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  


Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh
Hi,

I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign
that Hive can utilise a Spark engine higher than 1.3.1

My understanding was that there were miss-match on Hadoop assembly Jar
files that cause Hive not being able to run on Spark using the binary
downloads. I just tried Hive 2 on Spark 1.6 as the execution engine and it
crashed.

I do not know the development state of this cross-breed but will be very
desirable if we could manage to sort out
this spark-assembly-1.x.1-hadoop2.4.0.jar for once.

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: Error in Hive on Spark

2016-03-22 Thread Stana
Hi, Xuefu

You are right.
Maybe I should launch spark-submit by HS2 or Hive CLI ?

Thanks a lot,
Stana


2016-03-22 1:16 GMT+08:00 Xuefu Zhang :

> Stana,
>
> I'm not sure if I fully understand the problem. spark-submit is launched in
> the same host as your application, which should be able to access
> hive-exec.jar. Yarn cluster needs the jar also, but HS2 or Hive CLI will
> take care of that. Since you are not using either of which, then, it's your
> application's responsibility to make that happen.
>
> Did I missed anything else?
>
> Thanks,
> Xuefu
>
> On Sun, Mar 20, 2016 at 11:18 PM, Stana  wrote:
>
> > Does anyone have suggestions in setting property of hive-exec-2.0.0.jar
> > path in application?
> > Something like
> >
> >
> 'hiveConf.set("hive.remote.driver.jar","hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.
> >
> >
> >
> > 2016-03-11 10:53 GMT+08:00 Stana :
> >
> > > Thanks for reply
> > >
> > > I have set the property spark.home in my application. Otherwise the
> > > application threw 'SPARK_HOME not found exception'.
> > >
> > > I found hive source code in SparkClientImpl.java:
> > >
> > > private Thread startDriver(final RpcServer rpcServer, final String
> > > clientId, final String secret)
> > >   throws IOException {
> > > ...
> > >
> > > List argv = Lists.newArrayList();
> > >
> > > ...
> > >
> > > argv.add("--class");
> > > argv.add(RemoteDriver.class.getName());
> > >
> > > String jar = "spark-internal";
> > > if (SparkContext.jarOfClass(this.getClass()).isDefined()) {
> > > jar = SparkContext.jarOfClass(this.getClass()).get();
> > > }
> > > argv.add(jar);
> > >
> > > ...
> > >
> > > }
> > >
> > > When hive executed spark-submit , it generate the shell command with
> > > --class org.apache.hive.spark.client.RemoteDriver ,and set jar path
> with
> > > SparkContext.jarOfClass(this.getClass()).get(). It will get the local
> > path
> > > of hive-exec-2.0.0.jar.
> > >
> > > In my situation, the application and yarn cluster are in different
> > cluster.
> > > When application executed spark-submit with local path of
> > > hive-exec-2.0.0.jar to yarn cluster, there 's no hive-exec-2.0.0.jar in
> > > yarn cluster. Then application threw the exception:
> "hive-exec-2.0.0.jar
> > >   does not exist ...".
> > >
> > > Can it be set property of hive-exec-2.0.0.jar path in application ?
> > > Something like 'hiveConf.set("hive.remote.driver.jar",
> > > "hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.
> > > If not, is it possible to achieve in the future version?
> > >
> > >
> > >
> > >
> > > 2016-03-10 23:51 GMT+08:00 Xuefu Zhang :
> > >
> > >> You can probably avoid the problem by set environment variable
> > SPARK_HOME
> > >> or JVM property spark.home that points to your spark installation.
> > >>
> > >> --Xuefu
> > >>
> > >> On Thu, Mar 10, 2016 at 3:11 AM, Stana  wrote:
> > >>
> > >> >  I am trying out Hive on Spark with hive 2.0.0 and spark 1.4.1, and
> > >> > executing org.apache.hadoop.hive.ql.Driver with java application.
> > >> >
> > >> > Following are my situations:
> > >> > 1.Building spark 1.4.1 assembly jar without Hive .
> > >> > 2.Uploading the spark assembly jar to the hadoop cluster.
> > >> > 3.Executing the java application with eclipse IDE in my client
> > computer.
> > >> >
> > >> > The application went well and it submitted mr job to the yarn
> cluster
> > >> > successfully when using " hiveConf.set("hive.execution.engine",
> "mr")
> > >> > ",but it threw exceptions in spark-engine.
> > >> >
> > >> > Finally, i traced Hive source code and came to the conclusion:
> > >> >
> > >> > In my situation, SparkClientImpl class will generate the
> spark-submit
> > >> > shell and executed it.
> > >> > The shell command allocated  --class with
> RemoteDriver.class.getName()
> > >> > and jar with SparkContext.jarOfClass(this.getClass()).get(), so that
> > >> &

Re: Error in Hive on Spark

2016-03-20 Thread Stana
Does anyone have suggestions in setting property of hive-exec-2.0.0.jar
path in application?
Something like
'hiveConf.set("hive.remote.driver.jar","hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.



2016-03-11 10:53 GMT+08:00 Stana :

> Thanks for reply
>
> I have set the property spark.home in my application. Otherwise the
> application threw 'SPARK_HOME not found exception'.
>
> I found hive source code in SparkClientImpl.java:
>
> private Thread startDriver(final RpcServer rpcServer, final String
> clientId, final String secret)
>   throws IOException {
> ...
>
> List argv = Lists.newArrayList();
>
> ...
>
> argv.add("--class");
> argv.add(RemoteDriver.class.getName());
>
> String jar = "spark-internal";
> if (SparkContext.jarOfClass(this.getClass()).isDefined()) {
> jar = SparkContext.jarOfClass(this.getClass()).get();
> }
> argv.add(jar);
>
> ...
>
> }
>
> When hive executed spark-submit , it generate the shell command with
> --class org.apache.hive.spark.client.RemoteDriver ,and set jar path with
> SparkContext.jarOfClass(this.getClass()).get(). It will get the local path
> of hive-exec-2.0.0.jar.
>
> In my situation, the application and yarn cluster are in different cluster.
> When application executed spark-submit with local path of
> hive-exec-2.0.0.jar to yarn cluster, there 's no hive-exec-2.0.0.jar in
> yarn cluster. Then application threw the exception: "hive-exec-2.0.0.jar
>   does not exist ...".
>
> Can it be set property of hive-exec-2.0.0.jar path in application ?
> Something like 'hiveConf.set("hive.remote.driver.jar",
> "hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.
> If not, is it possible to achieve in the future version?
>
>
>
>
> 2016-03-10 23:51 GMT+08:00 Xuefu Zhang :
>
>> You can probably avoid the problem by set environment variable SPARK_HOME
>> or JVM property spark.home that points to your spark installation.
>>
>> --Xuefu
>>
>> On Thu, Mar 10, 2016 at 3:11 AM, Stana  wrote:
>>
>> >  I am trying out Hive on Spark with hive 2.0.0 and spark 1.4.1, and
>> > executing org.apache.hadoop.hive.ql.Driver with java application.
>> >
>> > Following are my situations:
>> > 1.Building spark 1.4.1 assembly jar without Hive .
>> > 2.Uploading the spark assembly jar to the hadoop cluster.
>> > 3.Executing the java application with eclipse IDE in my client computer.
>> >
>> > The application went well and it submitted mr job to the yarn cluster
>> > successfully when using " hiveConf.set("hive.execution.engine", "mr")
>> > ",but it threw exceptions in spark-engine.
>> >
>> > Finally, i traced Hive source code and came to the conclusion:
>> >
>> > In my situation, SparkClientImpl class will generate the spark-submit
>> > shell and executed it.
>> > The shell command allocated  --class with RemoteDriver.class.getName()
>> > and jar with SparkContext.jarOfClass(this.getClass()).get(), so that
>> > my application threw the exception.
>> >
>> > Is it right? And how can I do to execute the application with
>> > spark-engine successfully in my client computer ? Thanks a lot!
>> >
>> >
>> > Java application code:
>> >
>> > public class TestHiveDriver {
>> >
>> > private static HiveConf hiveConf;
>> > private static Driver driver;
>> > private static CliSessionState ss;
>> > public static void main(String[] args){
>> >
>> > String sql = "select * from hadoop0263_0 as a join
>> > hadoop0263_0 as b
>> > on (a.key = b.key)";
>> > ss = new CliSessionState(new
>> HiveConf(SessionState.class));
>> > hiveConf = new HiveConf(Driver.class);
>> > hiveConf.set("fs.default.name", "hdfs://storm0:9000");
>> > hiveConf.set("yarn.resourcemanager.address",
>> > "storm0:8032");
>> > hiveConf.set("yarn.resourcemanager.scheduler.address",
>> > "storm0:8030");
>> >
>> >
>> hiveConf.set("yarn.resourcemanager.resource-tracker.address","storm0:8031");
>> > hiveConf.set("yarn.resourcemanager.admin.address",
>> > "storm0:8033");
>> > hiveConf.set(

Re: Hive on Spark performance

2016-03-14 Thread sjayatheertha
Thanks for your response. We were evaluating Spark and were curious to know how 
it is used today and the lowest latency it can provide. 

> On Mar 14, 2016, at 8:37 AM, Mich Talebzadeh  
> wrote:
> 
> Hi Wlodeck,
> 
> Let us look at this.
> 
> In Oracle I have two tables channels and sales. This code works in Oracle
> 
>   1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s 
> WHERE c.channel_id = s.channel_id)) As R
>   2  from channels c
>   3* group by c.channel_id
> s...@mydb.mich.LOCAL> /
> CHANNEL_ID  R
> -- --
>  2 516050
>  31620984
>  4 473664
>  5  0
>  9  18666
> 
> I have the same tables In Hive but the same query crashes!
> 
> hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s 
> WHERE c.channel_id = s.channel_id)) As R
> > from channels c
> > group by c.channel_id
> > ;
> NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN 
> precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple | 
> precedenceEqualExpressionSingle );])
> 
> The solution is to use a temporary table to keep the sum/group by from sales 
> table as an intermediate stage  (temporary tables are session specific and 
> they are created and dropped after you finish the session)
> 
> hive> create temporary table tmp as select channel_id, count(channel_id) as 
> total from sales group by channel_id;
> 
> 
> Ok the rest is pretty easy
> 
> hive> select c.channel_id, c.channel_id * t.total as results
> > from channels c, tmp t
> > where c.channel_id = t.channel_id;
> 
> 2.0 2800432.0
> 3.0 8802300.0
> 4.0 2583552.0
> 9.0 104013.0
> 
> HTH
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 14 March 2016 at 14:22, ws  wrote:
>> Hive 1.2.1.2.3.4.0-3485
>> Spark 1.5.2
>> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
>> 
>> ### 
>> SELECT 
>>  f.description,
>>  f.item_number,
>>  sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = 
>> r.h_id)) as df_a
>> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
>> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND 
>> f.item_number NOT LIKE 'DEV%'
>> group by 
>>  f.description,
>>  f.item_number
>> ###
>> 
>> This query works fine in oracle but not Hive or Spark.
>> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where 
>> hb_h_name = r.h_id)) as df_a" field.
>> 
>> 
>> Thanks,
>> Wlodek
>> --
>> 
>> 
>> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh 
>>  wrote:
>> 
>> 
>> Depending on the version of Hive on Spark engine.
>> 
>> As far as I am aware the latest version of Hive that I am using (Hive 2) has 
>> improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark 
>> engine.
>> 
>> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is 
>> not the latest Spark but it is pretty good.
>> 
>> What specific concerns do you have in mind?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 13 March 2016 at 23:27, sjayatheertha  wrote:
>> Just curious if you could share your experience on the performance of spark 
>> in your company? How much data do you process? And what's the latency you 
>> are getting with spark engine?
>> 
>> Vidya
> 


Re: Hive on Spark performance

2016-03-14 Thread Mich Talebzadeh
Hi Wlodeck,

Let us look at this.

In Oracle I have two tables channels and sales. This code works in Oracle

  1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
  2  from channels c
  3* group by c.channel_id
s...@mydb.mich.LOCAL> /
CHANNEL_ID  R
-- --
 2 516050
 31620984
 4 473664
 5  0
 9  18666

I have the same tables In Hive but the same query crashes!

hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
> from channels c
> group by c.channel_id
> ;
NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN
precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple |
precedenceEqualExpressionSingle );])

The solution is to use a temporary table to keep the sum/group by from
sales table as an intermediate stage  (temporary tables are session
specific and they are created and dropped after you finish the session)

hive> create temporary table tmp as select channel_id, count(channel_id) as
total from sales group by channel_id;


Ok the rest is pretty easy

hive> select c.channel_id, c.channel_id * t.total as results
> from channels c, tmp t
> where c.channel_id = t.channel_id;

2.0 2800432.0
3.0 8802300.0
4.0 2583552.0
9.0 104013.0

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 14:22, ws  wrote:

> Hive 1.2.1.2.3.4.0-3485
> Spark 1.5.2
> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit
> Production
>
> ###
> SELECT
> f.description,
> f.item_number,
> sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id))
> as df_a
> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND
> f.item_number NOT LIKE 'DEV%'
> group by
> f.description,
> f.item_number
> ###
>
> This query works fine in oracle but not Hive or Spark.
> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where
> hb_h_name = r.h_id)) as df_a" field.
>
>
> Thanks,
> Wlodek
> --
>
>
> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Depending on the version of Hive on Spark engine.
>
> As far as I am aware the latest version of Hive that I am using (Hive 2)
> has improvements compared to the previous versions of Hive (0.14,1.2.1) on
> Spark engine.
>
> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it
> is not the latest Spark but it is pretty good.
>
> What specific concerns do you have in mind?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 13 March 2016 at 23:27, sjayatheertha  wrote:
>
> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya
>
>
>
>
>


Re: Hive on Spark performance

2016-03-14 Thread ws
Hive 1.2.1.2.3.4.0-3485Spark 1.5.2Oracle Database 11g Enterprise Edition 
Release 11.2.0.4.0 - 64bit Production
### SELECT  f.description, f.item_number, sum(f.df_a * (select count(1) from 
e.mv_A_h_a where hb_h_name = r.h_id)) as df_aFROM e.eng_fac_atl_sc_bf_qty f, 
wv_ATL_2_qty_df_rates rwhere f.item_number NOT LIKE 'HR%' AND f.item_number NOT 
LIKE 'UG%' AND f.item_number NOT LIKE 'DEV%'group by  f.description, 
f.item_number###
This query works fine in oracle but not Hive or Spark.So the problem is: 
"sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id)) as 
df_a" field.

Thanks,Wlodek-- 

On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh 
 wrote:
 

 Depending on the version of Hive on Spark engine. 
As far as I am aware the latest version of Hive that I am using (Hive 2) has 
improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark 
engine.
As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is not 
the latest Spark but it is pretty good.
What specific concerns do you have in mind?
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 13 March 2016 at 23:27, sjayatheertha  wrote:

Just curious if you could share your experience on the performance of spark in 
your company? How much data do you process? And what's the latency you are 
getting with spark engine?

Vidya



  

Re: Hive on Spark performance

2016-03-13 Thread Mich Talebzadeh
Depending on the version of Hive on Spark engine.

As far as I am aware the latest version of Hive that I am using (Hive 2)
has improvements compared to the previous versions of Hive (0.14,1.2.1) on
Spark engine.

As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is
not the latest Spark but it is pretty good.

What specific concerns do you have in mind?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 13 March 2016 at 23:27, sjayatheertha  wrote:

> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya


Hive on Spark performance

2016-03-13 Thread sjayatheertha
Just curious if you could share your experience on the performance of spark in 
your company? How much data do you process? And what's the latency you are 
getting with spark engine?

Vidya

Re: Error in Hive on Spark

2016-03-10 Thread Stana
Thanks for reply

I have set the property spark.home in my application. Otherwise the
application threw 'SPARK_HOME not found exception'.

I found hive source code in SparkClientImpl.java:

private Thread startDriver(final RpcServer rpcServer, final String
clientId, final String secret)
  throws IOException {
...

List argv = Lists.newArrayList();

...

argv.add("--class");
argv.add(RemoteDriver.class.getName());

String jar = "spark-internal";
if (SparkContext.jarOfClass(this.getClass()).isDefined()) {
jar = SparkContext.jarOfClass(this.getClass()).get();
}
argv.add(jar);

...

}

When hive executed spark-submit , it generate the shell command with
--class org.apache.hive.spark.client.RemoteDriver ,and set jar path with
SparkContext.jarOfClass(this.getClass()).get(). It will get the local path
of hive-exec-2.0.0.jar.

In my situation, the application and yarn cluster are in different cluster.
When application executed spark-submit with local path of
hive-exec-2.0.0.jar to yarn cluster, there 's no hive-exec-2.0.0.jar in
yarn cluster. Then application threw the exception: "hive-exec-2.0.0.jar
  does not exist ...".

Can it be set property of hive-exec-2.0.0.jar path in application ?
Something like 'hiveConf.set("hive.remote.driver.jar",
"hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar")'.
If not, is it possible to achieve in the future version?



2016-03-10 23:51 GMT+08:00 Xuefu Zhang :

> You can probably avoid the problem by set environment variable SPARK_HOME
> or JVM property spark.home that points to your spark installation.
>
> --Xuefu
>
> On Thu, Mar 10, 2016 at 3:11 AM, Stana  wrote:
>
> >  I am trying out Hive on Spark with hive 2.0.0 and spark 1.4.1, and
> > executing org.apache.hadoop.hive.ql.Driver with java application.
> >
> > Following are my situations:
> > 1.Building spark 1.4.1 assembly jar without Hive .
> > 2.Uploading the spark assembly jar to the hadoop cluster.
> > 3.Executing the java application with eclipse IDE in my client computer.
> >
> > The application went well and it submitted mr job to the yarn cluster
> > successfully when using " hiveConf.set("hive.execution.engine", "mr")
> > ",but it threw exceptions in spark-engine.
> >
> > Finally, i traced Hive source code and came to the conclusion:
> >
> > In my situation, SparkClientImpl class will generate the spark-submit
> > shell and executed it.
> > The shell command allocated  --class with RemoteDriver.class.getName()
> > and jar with SparkContext.jarOfClass(this.getClass()).get(), so that
> > my application threw the exception.
> >
> > Is it right? And how can I do to execute the application with
> > spark-engine successfully in my client computer ? Thanks a lot!
> >
> >
> > Java application code:
> >
> > public class TestHiveDriver {
> >
> > private static HiveConf hiveConf;
> > private static Driver driver;
> > private static CliSessionState ss;
> > public static void main(String[] args){
> >
> > String sql = "select * from hadoop0263_0 as a join
> > hadoop0263_0 as b
> > on (a.key = b.key)";
> > ss = new CliSessionState(new
> HiveConf(SessionState.class));
> > hiveConf = new HiveConf(Driver.class);
> > hiveConf.set("fs.default.name", "hdfs://storm0:9000");
> > hiveConf.set("yarn.resourcemanager.address",
> > "storm0:8032");
> > hiveConf.set("yarn.resourcemanager.scheduler.address",
> > "storm0:8030");
> >
> >
> hiveConf.set("yarn.resourcemanager.resource-tracker.address","storm0:8031");
> > hiveConf.set("yarn.resourcemanager.admin.address",
> > "storm0:8033");
> > hiveConf.set("mapreduce.framework.name", "yarn");
> > hiveConf.set("mapreduce.johistory.address",
> > "storm0:10020");
> >
> >
> hiveConf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://storm0:3306/stana_metastore");
> >
> >
> hiveConf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver");
> > hiveConf.set("javax.jdo.option.ConnectionUserName",
> > "root");
> > hiveConf.set("javax.jdo.option.ConnectionPassword",
> > "123456");
> > hiveConf.setBoolean("hive.auto.convert.join",f

Error in Hive on Spark

2016-03-10 Thread Stana
 I am trying out Hive on Spark with hive 2.0.0 and spark 1.4.1, and
executing org.apache.hadoop.hive.ql.Driver with java application.

Following are my situations:
1.Building spark 1.4.1 assembly jar without Hive .
2.Uploading the spark assembly jar to the hadoop cluster.
3.Executing the java application with eclipse IDE in my client computer.

The application went well and it submitted mr job to the yarn cluster
successfully when using " hiveConf.set("hive.execution.engine", "mr")
",but it threw exceptions in spark-engine.

Finally, i traced Hive source code and came to the conclusion:

In my situation, SparkClientImpl class will generate the spark-submit
shell and executed it.
The shell command allocated  --class with RemoteDriver.class.getName()
and jar with SparkContext.jarOfClass(this.getClass()).get(), so that
my application threw the exception.

Is it right? And how can I do to execute the application with
spark-engine successfully in my client computer ? Thanks a lot!


Java application code:

public class TestHiveDriver {

private static HiveConf hiveConf;
private static Driver driver;
private static CliSessionState ss;
public static void main(String[] args){

String sql = "select * from hadoop0263_0 as a join hadoop0263_0 
as b
on (a.key = b.key)";
ss = new CliSessionState(new HiveConf(SessionState.class));
hiveConf = new HiveConf(Driver.class);
hiveConf.set("fs.default.name", "hdfs://storm0:9000");
hiveConf.set("yarn.resourcemanager.address", "storm0:8032");
hiveConf.set("yarn.resourcemanager.scheduler.address", 
"storm0:8030");

hiveConf.set("yarn.resourcemanager.resource-tracker.address","storm0:8031");
hiveConf.set("yarn.resourcemanager.admin.address", 
"storm0:8033");
hiveConf.set("mapreduce.framework.name", "yarn");
hiveConf.set("mapreduce.johistory.address", "storm0:10020");

hiveConf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://storm0:3306/stana_metastore");

hiveConf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver");
hiveConf.set("javax.jdo.option.ConnectionUserName", "root");
hiveConf.set("javax.jdo.option.ConnectionPassword", "123456");
hiveConf.setBoolean("hive.auto.convert.join",false);
hiveConf.set("spark.yarn.jar",
"hdfs://storm0:9000/tmp/spark-assembly-1.4.1-hadoop2.6.0.jar");
hiveConf.set("spark.home","target/spark");
hiveConf.set("hive.execution.engine", "spark");
hiveConf.set("hive.dbname", "default");


driver = new Driver(hiveConf);
SessionState.start(hiveConf);

CommandProcessorResponse res = null;
try {
res = driver.run(sql);
} catch (CommandNeedRetryException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

System.out.println("Response Code:" + res.getResponseCode());
System.out.println("Error Message:" + res.getErrorMessage());
System.out.println("SQL State:" + res.getSQLState());

}
}




Exception of spark-engine:

16/03/10 18:32:58 INFO SparkClientImpl: Running client driver with
argv: 
/Volumes/Sdhd/Documents/project/island/java/apache/hive-200-test/hive-release-2.0.0/itests/hive-unit/target/spark/bin/spark-submit
--properties-file
/var/folders/vt/cjcdhms903x7brn1kbh558s4gn/T/spark-submit.7697089826296920539.properties
--class org.apache.hive.spark.client.RemoteDriver
/Users/stana/.m2/repository/org/apache/hive/hive-exec/2.0.0/hive-exec-2.0.0.jar
--remote-host MacBook-Pro.local --remote-port 51331 --conf
hive.spark.client.connect.timeout=1000 --conf
hive.spark.client.server.connect.timeout=9 --conf
hive.spark.client.channel.log.level=null --conf
hive.spark.client.rpc.max.size=52428800 --conf
hive.spark.client.rpc.threads=8 --conf
hive.spark.client.secret.bits=256
16/03/10 18:33:09 INFO SparkClientImpl: 16/03/10 18:33:09 INFO Client:
16/03/10 18:33:09 INFO SparkClientImpl:  client token: N/A
16/03/10 18:33:09 INFO SparkClientImpl:  diagnostics: N/A
16/03/10 18:33:09 INFO SparkClientImpl:  ApplicationMaster host: N/A
16/03/10 18:33:09 INFO SparkClientImpl:  ApplicationMaster RPC port: -1
16/03/10 18:33:09 INFO SparkClientImpl:  queue: default

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Elliot West
Related to this and for the benefit of anyone who is using Hive: The issues
around testing and some possible approaches are summarised here:

https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL


Ultimately there are no elegant solutions to the limitations correctly
described by Koert. However if you do choose to use Hive please be aware
that there are some good options out there for providing reasonable test
coverage of your production code. They aren't perfect by any means and are
certainly not at the level we've come to expect in other development
domains, but they are usable and therefore there is no excuse for not
writing tests! :-)

Elliot.


On 3 February 2016 at 04:49, Koert Kuipers  wrote:

> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>>
>>>> When comparing the performance, you need to do it apple vs apple. In
>>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>>> SQL. However, you configured Hive such that only two tasks can run in
>>>> parallel. However, you didn't provide information on how much Spark SQL is
>>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>>> see the resource usage in YARN resource manage URL.
>>>>
>>>> --Xuefu
>>>>
>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>>> Thanks Jeff.
>>>>>
>>>>>
>>>>>
>>>>> Obviously Hive is much more feature rich compared to Spark. Having
>>>>> said that in certain areas for example where the SQL feature is available
>>>>> in Spark, Spark seems to deliver faster.
>>>>>
>>>>>
>>>>>
>>>>> This may be:
>>>>>
>>>>>
>>>>>
>>>>> 1.Spark does both the optimisation and execution seamlessly
>>>>>
>>>>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>>>>> process
>>>>>
>>>>>
>>>>>
>>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>>> through Hive to both.
>>>>>
>>>>>
>>>>>
>>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>>
>>>>> 1   0   0   63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>>> xx
>>>>>
>>>>> 5   0   4   31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>>> xx
>>>>>
>>>>> 10  99  999 188
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>>>> xx
>>>>>
>>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>>
>>>>> 1   0   0   63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>>> xx
>>>>>
>>>>> 5   0   4   31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>>> xx
>&

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Mich Talebzadeh
Hi Edward,

 

There is another angle to it as well. Fit for purpose.

 

We are currently migrating from a propriety DW on SAN to Hive on JBOD. It is 
going smoothly. It will save us $$ in licensing fees in times where the 
technology and storage dollars are at premium.

 

Our DBAs that look after Oracle, SAP ASES and others are comfortable with Hive. 
They can look after the metastore (on Oracle) and working with me for HA for 
metastore and Hive serever2 in line with the standard for other databases.

 

I am sure if we had started with Spark, that would have worked but what the 
hec. We have MongoDB as well independent of HDFS.

 

These arguments about what is better or worse is the one we have had for years 
about Oracle, Sybase, MSSQL etc. I believe Hive is better for us because I 
think Hive. If I was more familiar with Spark, I am sure that would have been 
the opposite.

 

We can go in circles. Religious arguments really.

 

 

HTH,

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: 04 February 2016 17:41
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

Hive is not the correct tool for every problem. Use the tool that makes the 
most sense for your problem and your experience. 

 

Many people like hive because it is generally applicable. In my case study for 
the hive book I highlighted many smart capably organizations use hive. 

Your argument is totally valid. You like X better because X works for you. You 
don't need to 'preach' hear we all know hive has it's limits. 

 

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers mailto:ko...@tresata.com> > wrote:

Is the sky the limit? I know udfs can be used inside hive, like lambas 
basically i assume, and i will assume you have something similar for 
aggregations. But that's just abstractions inside a single map or reduce phase, 
pretty low level stuff. What you really need is abstractions around many map 
and reduce phases, because that is the level an algo is expressed at.

For example when doing logistic regression you want to be able to do something 
like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do many 
map and reduce steps internally (or even be defined at a higher level and 
compile into those steps). What is the equivalent in hive? Copy pasting crucial 
parts of the algo around while using udfs is just not the same thing in terms 
of reusability and abstraction. Its the opposite of keeping it DRY.

On Feb 3, 2016 1:06 AM, "Ryan Harris" mailto:ryan.har...@zionsbancorp.com> > wrote:

https://github.com/myui/hivemall

 

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

 

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

 

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO. 

 

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apache.org <mailto:user@hive.ap

  1   2   3   4   >