[ https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760398#comment-16760398 ]
BELUGA BEHR commented on MAPREDUCE-7180: ---------------------------------------- {quote} Ideally, the user responsible for this job would notice the large number of map failures, and follow-up, but this does not always happen in a timely fashion. If the job fails the first time it goes over its memory limit, the problem will more likely be addressed sooner and avoid wasting cluster resources. {quote} Yes. Your point is well taken. There should be a clear warning presented in a log (in a UI?) when this condition is happening to alleviate this exact scenario. Note to self: I mentioned a growth-factor of 1.0f. I can't think of a scenario where we would want to shrink the container size in response to a failure, so the configuration should be indexed on 0. A growth factor of 0.0 would be disabling and a 0.1f would be mean increase size 10% each attempt. > Relaunching Failed Containers > ----------------------------- > > Key: MAPREDUCE-7180 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: mrv1, mrv2 > Reporter: BELUGA BEHR > Priority: Major > > In my experience, it is very common that a MR job completely fails because a > single Mapper/Reducer container is using more memory than has been reserved > in YARN. The following message is logging the the MapReduce > ApplicationMaster: > {code} > Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] > is running beyond physical memory limits. > Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual > memory used. Killing container. > {code} > In this case, the container is re-launched on another node, and of course, it > is killed again for the same reason. This process happens three (maybe > four?) times before the entire MapReduce job fails. It's often said that the > definition of insanity is doing the same thing over and over and expecting > different results. > For all intents and purposes, the amount of resources requested by Mappers > and Reducers is a fixed amount; based on the default configuration values. > Users can set the memory on a per-job basis, but it's a pain, not exact, and > requires intimate knowledge of the MapReduce framework and its memory usage > patterns. > I propose that if the MR ApplicationMaster detects that a container is killed > because of this specific memory resource constraint, that it requests a > larger container for the subsequent task attempt. > For example, increase the requested memory size by 50% each time the > container fails and the task is retried. This will prevent many Job failures > and allow for additional memory tuning, per-Job, after the fact, to get > better performance (v.s. fail/succeed). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org