[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-09 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627082#comment-13627082
 ] 

Ravi Prakash commented on MAPREDUCE-3949:
-

The way to surely trigger the race is
1. Debug the NM and RM. 
2. Set a breakpoint on these lines: 
  - ContainerLaunch.java:347 : new DelayedProcessKiller(user, processId, 
sleepDelayBeforeSigKill, Signal.KILL, exec).start();
  - RMContainerImpl.java:289 : RMContainerFinishedEvent finishedEvent = 
(RMContainerFinishedEvent) event;
Run a job sure to exceed its container limits. Continue on both breakpoints 
(its just to give enough time to the AM to unregister)


 If AM fails due to overrunning resource limits, error not visible through UI 
 sometimes
 --

 Key: MAPREDUCE-3949
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor
 Attachments: MAPREDUCE-3949.patch


 I had a case where an MR AM eclipsed the configured memory limit. This caused 
 the AM's container to get killed, but nowhere accessible through the web UI 
 showed these diagnostics. I had to go view the NM's logs via ssh before I 
 could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-04 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622823#comment-13622823
 ] 

Ravi Prakash commented on MAPREDUCE-3949:
-

I discussed this with Tom and Daryn.

* If the NM was able to tell the AM why its being shot, the AM could 
technically include that diagnostic message in the 
FinishApplicationMasterRequest however such a mechnanism (to tell the AM the 
reason its being shot) doesn't exist yet and would be very brittle.
* If the NM told the RM before shooting the AM, there would still be a race. 
What if the AM had completed everything it needed to do, and then got shot by 
the NM? Then the job would have been successful but be marked as FAILED by the 
RM.
* If we changed the State of a FINISHED / KILLED application to FAILED / FAILED 
on receiving RMAppAttemptEventType.CONTAINER_FINISHED, the client might still 
only get the FINISHED / KILLED message and the user would have to go to the RM 
page to see what really happened. Our current opinion is that this is probably 
the best way to go for now.

Opinions anybody?


 If AM fails due to overrunning resource limits, error not visible through UI 
 sometimes
 --

 Key: MAPREDUCE-3949
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor

 I had a case where an MR AM eclipsed the configured memory limit. This caused 
 the AM's container to get killed, but nowhere accessible through the web UI 
 showed these diagnostics. I had to go view the NM's logs via ssh before I 
 could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-03 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621199#comment-13621199
 ] 

Ravi Prakash commented on MAPREDUCE-3949:
-

The race seems to be in between receiving a RMAppAttemptContainerFinishedEvent 
in RMContainerImpl.java's FinishedTransition and a 
FinishApplicationMasterRequest in ApplicationMasterService. Any preferences on 
how to fix it? A couple of options come to my mind:
1. Make the AM not send the FinishApplicationMasterRequest when it detects (if 
it can) that the NM is killing it.
2. Have the NM contact the RM before killing an AM container so that when the 
AM does send the FinishApplicationMasterRequest, the RM knows to ignore it.
3. Make the RMAppAttemptEventType.CONTAINER_FINISHED change the state of the 
AppAttempt even after FinishApplicationMasterRequest has changed the state to 
FINISHING / KILLED.

What do you think?

 If AM fails due to overrunning resource limits, error not visible through UI 
 sometimes
 --

 Key: MAPREDUCE-3949
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor

 I had a case where an MR AM eclipsed the configured memory limit. This caused 
 the AM's container to get killed, but nowhere accessible through the web UI 
 showed these diagnostics. I had to go view the NM's logs via ssh before I 
 could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-04-01 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619207#comment-13619207
 ] 

Ravi Prakash commented on MAPREDUCE-3949:
-

I'll work on this between other stuff I'm working on. I've been trying to find 
where the race is. The events from the NM looked good. Seems like the race is 
in the depths of RMAppImpl or RMAppAttemptImpl. Will keep digging.

I just checked, we can see this with the fair scheduler as well as the capacity 
scheduler.

 If AM fails due to overrunning resource limits, error not visible through UI 
 sometimes
 --

 Key: MAPREDUCE-3949
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor

 I had a case where an MR AM eclipsed the configured memory limit. This caused 
 the AM's container to get killed, but nowhere accessible through the web UI 
 showed these diagnostics. I had to go view the NM's logs via ssh before I 
 could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-3949) If AM fails due to overrunning resource limits, error not visible through UI sometimes

2013-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618236#comment-13618236
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3949:


[~raviprak] says [on 
MAPREDUCE-3688|https://issues.apache.org/jira/browse/MAPREDUCE-3688?focusedCommentId=13606901page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606901]:
bq. From my testing on trunk, I notice that even for the case where the AM goes 
over container limits (which I trigger with 
-Dyarn.app.mapreduce.am.resource.mb=512 
-Dyarn.app.mapreduce.am.command-opts=-Xmx3500m on a sleep job), sometimes the 
error is propagated back and sometimes its not. Can you please corroborate 
this? When State == FinalState == FAILED, the error is propagated back. However 
about half the times, State == FINISHED and FinalState == KILLED, in which case 
there is no message anywhere to help me. Not in the diagnostics, and there are 
no logs.

 If AM fails due to overrunning resource limits, error not visible through UI 
 sometimes
 --

 Key: MAPREDUCE-3949
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3949
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.24.0, 0.23.2
Reporter: Todd Lipcon
Assignee: Ravi Prakash
Priority: Minor

 I had a case where an MR AM eclipsed the configured memory limit. This caused 
 the AM's container to get killed, but nowhere accessible through the web UI 
 showed these diagnostics. I had to go view the NM's logs via ssh before I 
 could figure out what had happened to my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira