[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783420#comment-16783420
 ] 

BELUGA BEHR commented on MAPREDUCE-7180:
----------------------------------------

Hey [~wilfreds],

Thanks for the input.

I'm not exactly sure what the 'application re-run' is referring to.  My 
intention in this JIRA was to address only the Mappers/Reducers.   I have 
opened [YARN-9260] to discuss the application masters as a separate issue.  As 
I understand it tough, the AMs are retried one time if they fail, so if the 
retry is going to happen anyway, might as well throw some more memory at it.

I think this feature is helpful to have as yet another tool in the toolbox.  
One thing I still don't think we're on the same page about is that these 
retries, these various scenarios you point out, all valid, and all currently 
exist in the YARN framework.  There is already a retry capability implemented 
that allows for retries at both the application and the worker level.  For me, 
this is what started this conversation.  I noticed that the retry logic was too 
naive and would re-launch failed containers that had no hope of succeeding  
(because they would OOM every time, or be killed by NM for having too much 
overhead, regardless of where they ran).  This request is simply an extension 
of the retry capability to at least give the containers at least some 
opportunity at completing.  This ask is in line with standard retry and backoff 
strategies.

> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a 
> single Mapper/Reducer container is using more memory than has been reserved 
> in YARN.  The following message is logging the the MapReduce 
> ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] 
> is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual 
> memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it 
> is killed again for the same reason.  This process happens three (maybe 
> four?) times before the entire MapReduce job fails.  It's often said that the 
> definition of insanity is doing the same thing over and over and expecting 
> different results.
> For all intents and purposes, the amount of resources requested by Mappers 
> and Reducers is a fixed amount; based on the default configuration values.  
> Users can set the memory on a per-job basis, but it's a pain, not exact, and 
> requires intimate knowledge of the MapReduce framework and its memory usage 
> patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed 
> because of this specific memory resource constraint, that it requests a 
> larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the 
> container fails and the task is retried.  This will prevent many Job failures 
> and allow for additional memory tuning, per-Job, after the fact, to get 
> better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to