Re: Hive - Tez error with big join - Container expired.

2015-06-18 Thread Jianfeng (Jeff) Zhang

Tez will hold the idle containers for a while, but it would also expire the 
container if it reach some threshold.
Have you set property tez.am.container.idle.release-timeout-max.millis in 
tez-site.xml ? And can you attach the yarn app log ?



Best Regard,
Jeff Zhang


From: Daniel Klinger mailto:d...@web-computing.de>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Thursday, June 18, 2015 at 5:35 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Hive - Tez error with big join - Container expired.

Hi all,

I have a pretty big Hive Query. I'm joining over 3 Hive-Tables which have 
thousands of lines each. I'm grouping this join by several columns. In the 
Hive-Shell this query only reach about 80%. After about 1400 seconds its 
canceling with the following error:

Status: Failed
Vertex failed, vertexName=Map 2, vertexId=vertex_1434357133795_0008_1_01, 
diagnostics=[Task failed, taskId=task_1434357133795_0008_1_01_33, 
diagnostics=[TaskAttempt 0 failed, 
info=[Containercontainer_1434357133795_0008_01_39 finished while trying to 
launch. Diagnostics: [Container failed. Container expired since it was 
unused]], TaskAttempt 1 failed, 
info=[Containercontainer_1434357133795_0008_01_55 finished while trying to 
launch. Diagnostics: [Container failed. Container expired since it was 
unused]], TaskAttempt 2 failed, 
info=[Containercontainer_1434357133795_0008_01_72 finished while trying to 
launch. Diagnostics: [Container failed. Container expired since it was 
unused]], TaskAttempt 3 failed, 
info=[Containercontainer_1434357133795_0008_01_000101 finished while trying to 
launch. Diagnostics: [Container failed. Container expired since it was 
unused]]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex 
vertex_1434357133795_0008_1_01 [Map 2] killed/failed due to:null]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask

My yarn resource manager is at 100% during the whole execution (using all of 
the 300 GB memory). I tried to extend the live time of my containers with the 
following setting in the yarn-site.xml but no success:

yarn.resourcemanager.rm.container-allocation.expiry-interval-ms = 120

After this change my query stays at 0% over thousands of seconds. The query 
itself is working (tested with less data). How can I solve this problem.

Thanks for your help.

Greetz
DK


Re: Hive - Tez error with big join - Container expired.

2015-06-18 Thread Gopal Vijayaraghavan
> I have a pretty big Hive Query. I¹m joining over 3 Hive-Tables which
>have thousands of lines each. I¹m grouping this join by several columns.

Hive-on-Tez shouldn¹t have any issue even with billion of lines on a JOIN.

> 0 failed, info=[Containercontainer_1434357133795_0008_01_39 finished
>while trying to launch. Diagnostics: [Container failed. Container expired
>since it was unused]], TaskAttempt 1 failed,

Looks like your node manager is actually not spinning up a container that
was allocated (i.e allocation succeeded, but the task spin up failed).

Which YARN scheduler are you running (fair/capacity?) and do you have any
idea on what the logs on the NodeManager logs say about trying to spin up
this container?

If I¹m not wrong, you need to also check if the YARN user has a ulimit set
for the total number of processes on the NM nodes.


 
Cheers,
Gopal