RE: Mappers spawning Hive queries

2016-04-18 Thread Ryan Harris
I'm not aware of any particular reason that this shouldn't "inherently" work, 
but for debugging purposes I'd be wondering about the nested environment 
variables related to the hadoop job.the bash shell where you are trying to 
launch subsequent hive queries already has pre-existing hadoop job environment 
variables declared in the environment from the parent streaming job.I can't 
say for sure that there wouldn't be conflicts there.  So while I don't know of 
any reason that it definitely won't work, I know that you are venturing into 
uncharted territory and you may uncover unexpected edge-cases.


From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
Sent: Monday, April 18, 2016 3:44 PM
To: user@hive.apache.org
Subject: Re: Mappers spawning Hive queries

I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am in 
fact restructuring the approach.

However, I would like to understand what is going on. In my test run, each hive 
query is computing distinct on a toy table of 10 records -- so, we are 
definitely not running into problems like resource contention. Also, I 
increased (streaming) mappers' task timeout value (to 1hr) so that I give ample 
time for shell script (i.e., hive query) to finish. So, architecturally, is 
there something that limits us spawning such nested MR jobs -- a streaming MR 
job spawning multiple hive queries that in turn spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
My $0.02

If you are running multiple concurrent queries on the data, you are probably 
doing it wrong (or at least inefficiently)although this somewhat depends on 
what type of files are backing your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you 
are NOT using Tez/Spark as your execution engine

Generally with HDFS, data I/O is going to be the slowest pieceso, with your 
workflow, each hive query is going to need to read ALL of the source data to 
resolve the query.  It would be much more efficient if you could write a more 
complex query that could read the source data 1 time (instead of however many 
parallel operations you are running)Additionally, from an efficiency 
perspective running queries in parallel might only help improve performance if 
each of your queries requires fewer map tasks than the total capacity of your 
clusterotherwise it would  generally be more efficient to run your queries 
in series.

If you schedule the work in series, and things get backed up, the job will 
still eventually complete.  If you attempt to do TOO much work in parallel, all 
of the jobs will start timing out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, 
but I'd encourage you to look at restructuring your approach to the problem to 
save you more headaches down the road.

From: Shirish Tatikonda 
[mailto:shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>]
Sent: Monday, April 18, 2016 2:00 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each 
entity is associated with its own data set. The processing involves a few hive 
queries to do joins and aggregations, which is followed by some code in Python. 
My thought process is to put the hive queries and python invocation in a shell 
script, and invoke the shell script on multiple entities in parallel through a 
streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda 
> <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> wrote:
>
> Hello,
>
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job.
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1.
>
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior.
>
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same tas

Re: Mappers spawning Hive queries

2016-04-18 Thread Shirish Tatikonda
I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am
in fact restructuring the approach.

However, I would like to understand what is going on. In my test run, each
hive query is computing *distinct* on a toy table of 10 records -- so, we
are definitely not running into problems like resource contention. Also, I
increased (streaming) mappers' task timeout value (to 1hr) so that I give
ample time for shell script (i.e., hive query) to finish. So,
architecturally, is there something that limits us spawning such nested MR
jobs -- a streaming MR job spawning multiple hive queries that in turn
spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> My $0.02
>
>
>
> If you are running multiple concurrent queries on the data, you are
> probably doing it wrong (or at least inefficiently)although this
> somewhat depends on what type of files are backing your hive warehouse...
>
>
>
> Let's assume that your data is NOT backed by ORC/parquet files, and that
> you are NOT using Tez/Spark as your execution engine
>
>
>
> Generally with HDFS, data I/O is going to be the slowest pieceso, with
> your workflow, each hive query is going to need to read ALL of the source
> data to resolve the query.  It would be much more efficient if you could
> write a more complex query that could read the source data 1 time (instead
> of however many parallel operations you are running)Additionally, from
> an efficiency perspective running queries in parallel might only help
> improve performance if each of your queries requires fewer map tasks than
> the total capacity of your clusterotherwise it would  generally be more
> efficient to run your queries in series.
>
>
>
> If you schedule the work in series, and things get backed up, the job will
> still eventually complete.  If you attempt to do TOO much work in parallel,
> all of the jobs will start timing out and then everything will fail.
>
>
>
> There may be a valid reason for approaching the problem the way that you
> are, but I'd encourage you to look at restructuring your approach to the
> problem to save you more headaches down the road.
>
>
>
> *From:* Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
> *Sent:* Monday, April 18, 2016 2:00 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Mappers spawning Hive queries
>
>
>
> Hi John,
>
>
>
> 2) The shell script is invoked in the mappers of a Hadoop streaming job.
>
>
>
> 1) The use case is that I have to process multiple entities in parallel.
> Each entity is associated with its own data set. The processing involves a
> few hive queries to do joins and aggregations, which is followed by some
> code in Python. My thought process is to put the hive queries and python
> invocation in a shell script, and invoke the shell script on multiple
> entities in parallel through a streaming mapreduce job.
>
>
>
> Shirish
>
>
>
>
>
> On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
> Just out of curiosity, what is the use case behind this?
>
> How do you call the shell script?
>
>
> > On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatiko...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to run multiple hive queries in parallel by submitting them
> through a map-reduce job.
> > More specifically, I have a map-only hadoop streaming job where each
> mapper runs a shell script that does two things -- 1) parses input lines
> obtained via streaming; and 2) submits a very simple hive query (via hive
> -e ...) with parameters computed from step-1.
> >
> > Now, when I run the streaming job, the mappers seem to be stuck and I
> don't know what is going on. When I looked on resource manager web UI, I
> don't see any new MR Jobs (triggered from the hive query). I am trying to
> understand this behavior.
> >
> > This may be a bad idea to begin with, and there may be better ways to
> accomplish the same task. However, I would like to understand the behavior
> of such a MR job.
> >
> > Any thoughts?
> >
> > Thank you,
> > Shirish
> >
>
>
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


RE: Mappers spawning Hive queries

2016-04-18 Thread Ryan Harris
My $0.02

If you are running multiple concurrent queries on the data, you are probably 
doing it wrong (or at least inefficiently)although this somewhat depends on 
what type of files are backing your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you 
are NOT using Tez/Spark as your execution engine

Generally with HDFS, data I/O is going to be the slowest pieceso, with your 
workflow, each hive query is going to need to read ALL of the source data to 
resolve the query.  It would be much more efficient if you could write a more 
complex query that could read the source data 1 time (instead of however many 
parallel operations you are running)Additionally, from an efficiency 
perspective running queries in parallel might only help improve performance if 
each of your queries requires fewer map tasks than the total capacity of your 
clusterotherwise it would  generally be more efficient to run your queries 
in series.

If you schedule the work in series, and things get backed up, the job will 
still eventually complete.  If you attempt to do TOO much work in parallel, all 
of the jobs will start timing out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, 
but I'd encourage you to look at restructuring your approach to the problem to 
save you more headaches down the road.

From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
Sent: Monday, April 18, 2016 2:00 PM
To: user@hive.apache.org
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each 
entity is associated with its own data set. The processing involves a few hive 
queries to do joins and aggregations, which is followed by some code in Python. 
My thought process is to put the hive queries and python invocation in a shell 
script, and invoke the shell script on multiple entities in parallel through a 
streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda 
> <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> wrote:
>
> Hello,
>
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job.
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1.
>
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior.
>
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same task. However, I would like to understand the behavior of 
> such a MR job.
>
> Any thoughts?
>
> Thank you,
> Shirish
>


==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.


Re: Mappers spawning Hive queries

2016-04-18 Thread Mich Talebzadeh
What is the version of Hive and the execution engine (MR, Tez, Spark)?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 20:59, Shirish Tatikonda 
wrote:

> Hi John,
>
> 2) The shell script is invoked in the mappers of a Hadoop streaming job.
>
> 1) The use case is that I have to process multiple entities in parallel.
> Each entity is associated with its own data set. The processing involves a
> few hive queries to do joins and aggregations, which is followed by some
> code in Python. My thought process is to put the hive queries and python
> invocation in a shell script, and invoke the shell script on multiple
> entities in parallel through a streaming mapreduce job.
>
> Shirish
>
>
> On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke 
> wrote:
>
>> Just out of curiosity, what is the use case behind this?
>>
>> How do you call the shell script?
>>
>> > On 16 Apr 2016, at 00:24, Shirish Tatikonda <
>> shirish.tatiko...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I am trying to run multiple hive queries in parallel by submitting them
>> through a map-reduce job.
>> > More specifically, I have a map-only hadoop streaming job where each
>> mapper runs a shell script that does two things -- 1) parses input lines
>> obtained via streaming; and 2) submits a very simple hive query (via hive
>> -e ...) with parameters computed from step-1.
>> >
>> > Now, when I run the streaming job, the mappers seem to be stuck and I
>> don't know what is going on. When I looked on resource manager web UI, I
>> don't see any new MR Jobs (triggered from the hive query). I am trying to
>> understand this behavior.
>> >
>> > This may be a bad idea to begin with, and there may be better ways to
>> accomplish the same task. However, I would like to understand the behavior
>> of such a MR job.
>> >
>> > Any thoughts?
>> >
>> > Thank you,
>> > Shirish
>> >
>>
>
>


Re: Mappers spawning Hive queries

2016-04-18 Thread Shirish Tatikonda
Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel.
Each entity is associated with its own data set. The processing involves a
few hive queries to do joins and aggregations, which is followed by some
code in Python. My thought process is to put the hive queries and python
invocation in a shell script, and invoke the shell script on multiple
entities in parallel through a streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke  wrote:

> Just out of curiosity, what is the use case behind this?
>
> How do you call the shell script?
>
> > On 16 Apr 2016, at 00:24, Shirish Tatikonda 
> wrote:
> >
> > Hello,
> >
> > I am trying to run multiple hive queries in parallel by submitting them
> through a map-reduce job.
> > More specifically, I have a map-only hadoop streaming job where each
> mapper runs a shell script that does two things -- 1) parses input lines
> obtained via streaming; and 2) submits a very simple hive query (via hive
> -e ...) with parameters computed from step-1.
> >
> > Now, when I run the streaming job, the mappers seem to be stuck and I
> don't know what is going on. When I looked on resource manager web UI, I
> don't see any new MR Jobs (triggered from the hive query). I am trying to
> understand this behavior.
> >
> > This may be a bad idea to begin with, and there may be better ways to
> accomplish the same task. However, I would like to understand the behavior
> of such a MR job.
> >
> > Any thoughts?
> >
> > Thank you,
> > Shirish
> >
>


Re: Mappers spawning Hive queries

2016-04-16 Thread Jörn Franke
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda  
> wrote:
> 
> Hello,
> 
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job. 
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1. 
> 
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior. 
> 
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same task. However, I would like to understand the behavior of 
> such a MR job.
> 
> Any thoughts?
> 
> Thank you,
> Shirish
> 


Mappers spawning Hive queries

2016-04-15 Thread Shirish Tatikonda
Hello,

I am trying to run multiple hive queries in parallel by submitting them
through a map-reduce job.
More specifically, I have a map-only hadoop streaming job where each mapper
runs a shell script that does two things -- 1) parses input lines obtained
via streaming; and 2) submits a very simple hive query (via hive -e ...)
with parameters computed from step-1.

Now, when I run the streaming job, the mappers seem to be stuck and I don't
know what is going on. When I looked on resource manager web UI, I don't
see any new MR Jobs (triggered from the hive query). I am trying to
understand this behavior.

This may be a bad idea to begin with, and there may be better ways to
accomplish the same task. However, I would like to understand the behavior
of such a MR job.

Any thoughts?

Thank you,
Shirish