[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963048#comment-13963048
 ] 

Eric Payne commented on MAPREDUCE-4937:
---------------------------------------

When size of the meta info file exceeds the value specified by the 
mapreduce.jobtracker.split.metainfo.maxsize property set for a job, 
SplitMetaInfoReader.readSplitMetaInfo() throws an IOException. And, 
JobImpl$InitTransition.transition() is trying to catch an IOException. However, 
JobImpl$InitTransition.createSplits(), which is called in between 
JobImpl$InitTransition.transition() and 
SplitMetaInfoReader.readSplitMetaInfo(), catches the IOException, and instead 
throws a YarnRuntimeException.

Since this runtime exception is not expected or caught by JobImpl anywhere, the 
proper diagnostic is not set. The result is that on the Job UI for the failed 
job, a cryptic message is displayed:
AM Container for appattempt_1396371248625_0009_000003 exited with exitCode: 1 
due to: Exception from container-launch: 
org.apache.hadoop.util.Shell$ExitCodeException:
...


> MR AM handles an oversized split metainfo file poorly
> -----------------------------------------------------
>
>                 Key: MAPREDUCE-4937
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4937
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Eric Payne
>
> When an job runs with a split metainfo file that's larger than it has been 
> configured to handle then it just crashes.  This leaves the user with a 
> less-than-ideal debug session since there are no useful diagnostic messages 
> sent to the client for this failure.  In addition it crashes before 
> registering/unregistering with the RM and crashes without generating history, 
> so the proxy URL is not very useful and there's no archived configuration to 
> check to see what setting the AM was using when it encountered the error.
> The AM should handle this error case more gracefully and treat the failure as 
> it does any other failed job, with a proper unregistration from the RM and 
> with history.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to