subject:"\[jira\] \[Commented\] \(MAPREDUCE\-3949\) If AM fails due to overrunning resource limits, error not visible through UI sometimes"

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-09 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627082#comment-13627082
]

Ravi Prakash commented on MAPREDUCE-3949:
-

The way to surely trigger the race is
1. Debug the NM and RM.
2. Set a breakpoint on these lines:
- ContainerLaunch.java:347 : new DelayedProcessKiller(user, processId,
sleepDelayBeforeSigKill, Signal.KILL, exec).start();
- RMContainerImpl.java:289 : RMContainerFinishedEvent finishedEvent =
(RMContainerFinishedEvent) event;
Run a job sure to exceed its container limits. Continue on both breakpoints
(its just to give enough time to the AM to unregister)

If AM fails due to overrunning resource limits, error not visible through UI
sometimes
--

Key: MAPREDUCE-3949
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor
Attachments: MAPREDUCE-3949.patch

I had a case where an MR AM eclipsed the configured memory limit. This caused
the AM's container to get killed, but nowhere accessible through the web UI
showed these diagnostics. I had to go view the NM's logs via ssh before I
could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-04 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622823#comment-13622823
]

Ravi Prakash commented on MAPREDUCE-3949:
-

I discussed this with Tom and Daryn.

* If the NM was able to tell the AM why its being shot, the AM could
technically include that diagnostic message in the
FinishApplicationMasterRequest however such a mechnanism (to tell the AM the
reason its being shot) doesn't exist yet and would be very brittle.
* If the NM told the RM before shooting the AM, there would still be a race.
What if the AM had completed everything it needed to do, and then got shot by
the NM? Then the job would have been successful but be marked as FAILED by the
RM.
* If we changed the State of a FINISHED / KILLED application to FAILED / FAILED
on receiving RMAppAttemptEventType.CONTAINER_FINISHED, the client might still
only get the FINISHED / KILLED message and the user would have to go to the RM
page to see what really happened. Our current opinion is that this is probably
the best way to go for now.

Opinions anybody?

If AM fails due to overrunning resource limits, error not visible through UI
sometimes
--

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-03 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621199#comment-13621199
]

Ravi Prakash commented on MAPREDUCE-3949:
-

The race seems to be in between receiving a RMAppAttemptContainerFinishedEvent
in RMContainerImpl.java's FinishedTransition and a
FinishApplicationMasterRequest in ApplicationMasterService. Any preferences on
how to fix it? A couple of options come to my mind:
1. Make the AM not send the FinishApplicationMasterRequest when it detects (if
it can) that the NM is killing it.
2. Have the NM contact the RM before killing an AM container so that when the
AM does send the FinishApplicationMasterRequest, the RM knows to ignore it.
3. Make the RMAppAttemptEventType.CONTAINER_FINISHED change the state of the
AppAttempt even after FinishApplicationMasterRequest has changed the state to
FINISHING / KILLED.

What do you think?

If AM fails due to overrunning resource limits, error not visible through UI
sometimes
--

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-01 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619207#comment-13619207
 ] 

Ravi Prakash commented on MAPREDUCE-3949:
-

I'll work on this between other stuff I'm working on. I've been trying to find 
where the race is. The events from the NM looked good. Seems like the race is 
in the depths of RMAppImpl or RMAppAttemptImpl. Will keep digging.

I just checked, we can see this with the fair scheduler as well as the capacity 
scheduler.

 If AM fails due to overrunning resource limits, error not visible through UI 
 sometimes
 --

 Key: MAPREDUCE-3949
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor

 I had a case where an MR AM eclipsed the configured memory limit. This caused 
 the AM's container to get killed, but nowhere accessible through the web UI 
 showed these diagnostics. I had to go view the NM's logs via ssh before I 
 could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618236#comment-13618236
]

Vinod Kumar Vavilapalli commented on MAPREDUCE-3949:

[~raviprak] says [on
MAPREDUCE-3688|https://issues.apache.org/jira/browse/MAPREDUCE-3688?focusedCommentId=13606901page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606901]:
bq. From my testing on trunk, I notice that even for the case where the AM goes
over container limits (which I trigger with
-Dyarn.app.mapreduce.am.resource.mb=512
-Dyarn.app.mapreduce.am.command-opts=-Xmx3500m on a sleep job), sometimes the
error is propagated back and sometimes its not. Can you please corroborate
this? When State == FinalState == FAILED, the error is propagated back. However
about half the times, State == FINISHED and FinalState == KILLED, in which case
there is no message anywhere to help me. Not in the diagnostics, and there are
no logs.

If AM fails due to overrunning resource limits, error not visible through UI
sometimes
--

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

5 matches

Site Navigation

Mail list logo

Footer information