Re: delay before query starts processing

Ariel Marcus Wed, 30 Jan 2013 15:24:07 -0800

>From the archives:
http://mail-archives.apache.org/mod_mbox/hive-user/201110.mbox/%3CCAC9SPjuQtxOK1KtEmReD6OanNTgNM_uLkGQD+=n7krcjcal...@mail.gmail.com%3E


TL;DR set hive.optimize.s3.query=true;

---------------------------------
Ariel Marcus, Consultant
www.openbi.com | ariel.mar...@openbi.com
150 N Michigan Avenue, Suite 2800, Chicago, IL 60601
Cell: 314-827-4356


On Wed, Jan 30, 2013 at 6:16 PM, Marc Limotte <mslimo...@gmail.com> wrote:

> Hi,
>
> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot of
> other Hadoop jobs, but only started experimenting with Hive recently.
>
> I've been seeing a long pause after submitting a hive query and the
> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
> wondering what's happening during this time.  Either a high level answer,
> or maybe there is some logging I can turn on?
>
> Here's some more detail.  I submit the query on the master using the hive
> cli, and start to see some output right away...
>
> Total MapReduce jobs = 2
> Launching Job 1 out of 2
> Number of reduce tasks not specified. Estimated from input data size: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
>
>
> *[then a long delay here: 10 minutes or more... no activity in the hadoop
> job tracker ui] *
>
>
> … and then it continues normally ...
> Starting Job = job_201301160029_0082, Tracking URL =
> http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
> Kill Command = /home/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
> Hadoop job information for Stage-1: number of mappers: 2; number of
> reducers: 1
> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
> …
>
> This query is processing in the neighborhood of 500GB of data from S3.  A
> couple of possibilities I thought of… perhaps someone can confirm or deny:
> a) Is the data copied from S3 to HDFS during this time?
> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
> MB)-- does it have to copy these around to the tasks at this time?
>
> Any insights appreciated.
>
> Marc
>
>
>
>
> ------------------------------
>
> This transmission is confidential and intended solely for the use of the
> recipient named above. It may contain confidential, proprietary, or legally
> privileged information. If you are not the intended recipient, you are
> hereby notified that any unauthorized review, use, disclosure or
> distribution is strictly prohibited. If you have received this transmission
> in error, please contact the sender by reply e-mail and delete the original
> transmission and all copies from your system.
>

-- 

------------------------------

This transmission is confidential and intended solely for the use of the 
recipient named above. It may contain confidential, proprietary, or legally 
privileged information. If you are not the intended recipient, you are 
hereby notified that any unauthorized review, use, disclosure or 
distribution is strictly prohibited. If you have received this transmission 
in error, please contact the sender by reply e-mail and delete the original 
transmission and all copies from your system.

Re: delay before query starts processing

Reply via email to