[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-12-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786044#action_12786044
 ] 

Hudson commented on MAPREDUCE-1119:
---

Integrated in Hadoop-Mapreduce-trunk #164 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/164/])
. When tasks fail to report status, show tasks's stack dump before killing. 
Contributed by Aaron Kimball.


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785804#action_12785804
 ] 

Hudson commented on MAPREDUCE-1119:
---

Integrated in Hadoop-Mapreduce-trunk-Commit #143 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/143/])
. When tasks fail to report status, show tasks's stack dump before killing. 
Contributed by Aaron Kimball.


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-12-02 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785176#action_12785176
 ] 

Vinod K V commented on MAPREDUCE-1119:
--

The contrib test failure is related to MAPREDUCE-1124. This patch is good for 
check-in.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784847#action_12784847
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12426315/MAPREDUCE-1119.6.patch
  against trunk revision 886059.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 17 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-12-02 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784707#action_12784707
 ] 

Vinod K V commented on MAPREDUCE-1119:
--

bq. I am not very sure we want to use the same configuration property for 
sleeping after dump-stack
bq. So I think the delay is necessary. As for adding a new config item.. maybe 
wait til someone makes an issue of this? Is there any performance implication 
here?
Given that the delay is necessary, it is only about the config item - new or 
the same one. Thinking a bit about it, i am fine with keeping the same 
configuration because that kind of gives the same time for thread-dump as the 
time that is given for the process to clean itself up. I know it isn't a very 
concrete reason but it works for me for now :)

Gone through your latest patch. It looks good now. +1 from my side.

I'll run it through Hudson once again to be sure about the test-cases. May be  
{{TestGridmixSubmission}} and {{TestJobHistory.testDoneFolderOnHDFS()}} 
failures are timing related.

After Hudson blesses, can you ask someone to commit this? Thanks for being 
patient throughout!

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-30 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783830#action_12783830
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

* GridMix has been sporadically failing for a while now in Hudson, but not in a 
deterministic fashion. Cannot reproduce locally.
* The fair scheduler test times out on both my trunk and MAPREDUCE-1119 
branches locally. MAPREDUCE-1245?
* TestJobHistory passes locally both on my trunk and MAPREDUCE-1119 branches.


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-29 Thread 易剑
We shouldn't instance all objects when a job is submitted, which will expand
the memory.

2009/11/30 易剑 

> I want to initialize tasks into file on the jobtracker, which will reduce
> the memory.


Re: [jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-29 Thread 易剑
I want to initialize tasks into file on the jobtracker, which will reduce
the memory.


[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-28 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783267#action_12783267
 ] 

Arun C Murthy commented on MAPREDUCE-1119:
--

Which of these tests fail without our patch i.e. on a clean trunk?

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-28 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783265#action_12783265
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

GridMix and FairScheduler test failures are definitely unrelated. I ran 
TestJobHistory locally and it passes -- also unrelated, I think?

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783225#action_12783225
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12426315/MAPREDUCE-1119.6.patch
  against trunk revision 884832.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 17 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.6.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-27 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783203#action_12783203
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

To address your various points:

bq. src/c++/task-controller/main.c
The program attempts to deduce its configuration directory based on 
{{argv[0]}}. Given that task-controller should be in {{$HADOOP_HOME/bin/}}, it 
will do some surgery on the program name to remove "bin/task-controller" and 
add the "conf" suffix, giving {{$HADOOP_HOME/conf}}. Unfortunately, this 
presumes that {{argv[0]}} is the absolute path to {{task-controller}}. If it's 
not, then this will result in a negative offset being calculated -- and it will 
attempt to malloc a negative amount of space, which causes a SIGSEGV. This 
checks that task-controller was invoked with {{argv[0]}} set to the absolute 
path to task-controller, and not just something like './task-controller'

Hadoop may invoke task-controller via its absolute path, but when I was unit 
testing it, I just ran './task-controller', which was causing it to segfault. 
This should exit politely instead.

bq. Now that the semantics have changed, I am not very sure we want to use the 
same configuration property for sleeping after dump-stack. (Thinking aloud..) 
Do we even need a sleep here? The signalling order is 
SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their 
arrival? If so, then we will not another sleep. If not, we may need a sleep 
here, but may or may not be driven by the same config item. What do you think?

I'm not honestly sure. I believe that the operating system will raise the 
signals to the process in order. However, what it means to "handle" a signal is 
ambiguous. It may be the case that the signal handler for SIGQUIT merely sets a 
flag which tells another thread to print the stack dumps to stdout at some 
other convenient interval (e.g., during a gc when the world is stopped). So 
while SIGQUIT may be received by the JVM before SIGTERM, I don't take that as a 
guarantee that the data will reach stdout when we want it to. So I think the 
delay is necessary. As for adding a new config item.. maybe wait til someone 
makes an issue of this? Is there any performance implication here?

bq. Are we sure "PSPermGen" will always be there in the dump? Instead how about 
passing our own TaskController that does custom actions in 
TaskController.dumpStacks(), simplifying our verification that dump-stack is 
indeed called?

In the Sun JVM, the stack dump messages always end with a description of the 
heap, which includes this key string. I doubt this is standardized; JRockit or 
another JVM will probably use some other string. Testing with a 
MockTaskController is probably a good idea here.

bq. The test-time can be more than halved if we set max-map-attempts to one in 
both the tests via conf.setMaxMapAttempts(1);

Good observation. will do.

bq. We need a similar test for LinuxTaskController to test stack-dump when 
multiple users are involved.

I'll look into this.

I'll also handle the other more minor suggestions you made as well.
Thanks for the review!


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-16 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778288#action_12778288
 ] 

Vinod K V commented on MAPREDUCE-1119:
--

The patch looks very clean now! Thanks! It is very close, I have only few 
comments on the latest patch, most of them minor:
 - Care explain the changes to {{src/c++/task-controller/main.c}} w.r.t 
conf_dir_len? Both for my confirmation as well as for record's sake..
 - Change c comments for {{kill_user_task()}} in 
{{src/c++task-controller/task-controller.c}} mentioning that it can 
terminate/kill or dump-stack?
 - Now that the semantics have changed, I am not very sure we want to use the 
same configuration property for sleeping after dump-stack. (Thinking aloud..) 
Do we even need a sleep here? The signalling order is 
SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their 
arrival? If so, then we will not another sleep. If not, we may need a sleep 
here, but may or may not be driven by the same config item. What do you think?
 - All the three newly added methods in {{JvmManager}} can be package-private 
or private.
 - ProcessTree.java:
   -- Lot of refactoring. Nice!
   -- The variables SIG* and SIG*_STR can all be private, so can 
{{maybeSignalProcess()}} and {{maybeSignalProcessGroup()}} be.
 - TestJobKillAndFail
   -- Are we sure "PSPermGen" will always be there in the dump? Instead how 
about passing our own {{TaskController}} that does custom actions in 
{{TaskController.dumpStacks()}}, simplifying our verification that dump-stack 
is indeed called?
   -- The test now takes very long time. The test-time can be more than halved 
if we set max-map-attempts to one in both the tests via 
{{conf.setMaxMapAttempts(1);}}
 - We need a similar test for {{LinuxTaskController}} to test stack-dump when 
multiple users are involved. You can look at 
{{TestLocalizationWithLinuxTaskController}} and/or 
{{TestJobExecutionAsDifferentUser}} for inspiration.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777433#action_12777433
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12424806/MAPREDUCE-1119.5.patch
  against trunk revision 835237.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-12 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777358#action_12777358
 ] 

Vinod K V commented on MAPREDUCE-1119:
--

bq. (ie only do so in the case of an unresponsive task)
That is exactly what I intended. Let's just dump it only when TT calls 
{{purgeTask()}} in {{markUnresponsiveTasks()}} to kill an unresponsive task. It 
simply avoids a lot of other changes the last patch includes.

With this, we will have the feature to dump stacks in place. We can use it in 
future if we wish to do the same with other cases.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-12 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777353#action_12777353
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Todd, I think it's me who misunderstood :) I interpreted Vinod's comment 
entirely the opposite way, but it actually makes way more sense to handle it 
this way. Vinod, care to clarify your intent?

Assuming you mean that we should *never* dump stack on exception, that seems 
like a reasonable approach. I'll see what I can throw together.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777347#action_12777347
 ] 

Todd Lipcon commented on MAPREDUCE-1119:


Maybe I misunderstood, but I thought Vinod's suggestion was to "remove the 
uncertainty" by making sure it *never* dumps stack in the case of a child error 
(ie *only* do so in the case of an unresponsive task). That is what makes most 
sense to me, since the child error itself is already setting stack traces in 
diagnostic info, etc.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-12 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777341#action_12777341
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Vinod,

I'm not sure what markUnresponsiveTasks() has to do with anything; that's not 
called from either fsError() or fatalError(), which are how the child signals 
an exception back to the TT. 

Even still, I do not believe that calling dumpTaskStack() from within 
fsError/fatalError solves the problem either. Signals are delivered 
asynchronously. Even if the signal were sent inside of fsError() before its 
call to purgeTask(), there is an arbitrary amount of time before that is 
handled by the child program. (Not to mention that purgeTask()'s logic in this 
patch is 'send SIGQUIT, wait, then SIGKILL' -- so doing the SIGQUIT before 
purgeTask() does not change the effective serialization.) Furthermore, the act 
of printing the stack trace to stderr itself takes an arbitrary amount of time. 
The child program after calling TaskUmbilicalProtocol.fsError() or fatalError() 
will still immediately exit all threads; there is no guarantee that it will 
service the signal before terminating naturally.

Based on the logic in org.apache.hadoop.mapred.Child, it is unlikely that the 
purgeTask() associated with TUP.fsError()/fatalError() is actually killing the 
task; it's probably exiting of its own accord first (otherwise, we'd be seeing 
the stack dumps since the SIGQUIT is serialized before the SIGKILL as-is). So I 
don't think this removes any nondeterminism from the equation. If we want child 
exceptions to *definitely* trigger a stack dump in the event of an exception, 
then the child should enumerate its own threads, call Thread.getStackTrace() on 
them, and print them to stderr itself to ensure that this process completes 
before exiting the JVM, without relying on asynchronous signals.



> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-12 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776896#action_12776896
 ] 

Vinod K V commented on MAPREDUCE-1119:
--

We can avoid the uncertainity with dumping stack in case of Child exception by 
directly calling {{taskController.dumpTaskStack(context)}} from inside 
{{TaskTracker.markUnresponsiveTasks()}} immediately before 
{{purgeTask(tip,true)}}. This will create the dump only when it is absolutely 
needed.

Would that work? To construct the context, you may need a bridging method 
inside {{JvmManager}} which can itself call 
{{taskController.dumpTaskStack(context)}}.

That will make the patch a lot simpler, and will avoid many other avoidable 
changes to {{JmvManager}}/{{TaskController}}. If you wish, I can upload a 
demonstrating patch. Thoughts? 

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-11 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776773#action_12776773
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Error is in gridmix contrib.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776771#action_12776771
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12424659/MAPREDUCE-1119.4.patch
  against trunk revision 834284.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-11 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776701#action_12776701
 ] 

Todd Lipcon commented on MAPREDUCE-1119:


+1 pending Hudson QA results from the last patch (don't see why it wouldn't be 
fine given it's just javadoc changes since last one)

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-11 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776687#action_12776687
 ] 

Todd Lipcon commented on MAPREDUCE-1119:


Hey Aaron,

Thanks for doing the investigation to fill in the table.

The behavior as you've described it all seems pretty reasonable - I wish the 
"maybe" were "false", but it sounds like it does require a pretty significant 
overhaul of failure tracking throughout the TT/TaskManager/etc, so I'd be 
inclined to say it's out of scope.

So, +0.5 from me on the current patch. Full +1 if some of the above explanation 
could be transformed into javadoc on the wasFailure parameter - probably just 
teh full explanation where it first appears, and then a "@see 
FirstAppearanceClass.someMethodName" for that parameter elsewhere, so we don't 
have duplication of the explanation.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-11 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776648#action_12776648
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

I modeled this parameter after the fact that TaskTracker was already using this 
same name (see {{setTaskFailState()}}, {{jobHasFinished()}}, {{kill()}}, 
{{cleanUpOverMemoryTask()}}) to indicate whether a kill was failure-based or 
for other purposes (cleanup/preemption/etc).

I think there is perhaps a more systemic overhaul of failure reason tracking 
that should occur as a separate issue?

As for your table... if you look at {{TaskController.destroyTaskJVM()}} (line 
151), you can see that generates-stack is true iff {{wasFailure}} is true.

I ran some tests by running sleep job which slept for 60 seconds in each call 
to {{map()}}. Results follow:

|*Test case*|*Stack dump?*|
|set {{mapreduce.task.timeout}} to 1 (task timeout)|yes|
|ran {{bin/mapred job -kill-task}} on attempts|no|
|ran {{bin/mapred job -fail-task}} on attempts|no|
|Let it complete successfully|no|
|ran {{bin/mapred job -kill}} on the job itself.|no|
|threw a RuntimeException in the mapper|no|

Thus, I believe that translates into the following for your table:

|*Reason*|*wasFailure*|*generates stack*|
|Child exception|maybe|maybe*|
|Other task failures|false|false|
|Task timeout|true|true|
|Task killed by user|false|false|
|Task failed by user|false|false|
|Job killed by user|false|false|

Looking at {{org.apache.hadoop.mapred.Child}}, there are a few different catch 
blocks in there:
* If a task throws a {{FSError}}, this triggers 
{{TaskUmbilicalProtocol.fsError()}}, which will cause a {{purgeTask(tip, 
wasFailure=true)}}.
* If a task throws any other sort of {{Exception}}, this does not trigger a 
particular response to the TUP; The exception string is passed to 
{{TaskTracker.reportDiagnosticInfo()}}, but this simply logs a string of text 
and takes no further action.
* If a map task throws any other {{Throwable}}, this triggers 
{{TUP.fatalError()}}, which also calls {{purgeTask(tip, wasFailure=true)}}.

But immediately after these catch blocks, it closes the RPC proxy, shuts down 
the logging thread, and exits the JVM. So fsError and fatalError *may* cause a 
stack dump if the TT processes the request fast enough and issues a SIGQUIT in 
the next few microseconds. But this is racing against the fact that the child 
task's next action is "exit immediately."

Note that job kill, task timeout, and the task exception cases are all covered 
in the unit test provided in this patch.


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-11 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776606#action_12776606
 ] 

Todd Lipcon commented on MAPREDUCE-1119:


New patch looks better. A couple changes before I think it's ready for commit:
- The "wasFailure" boolean isn't terribly clear. I'd like to either see some 
more javadoc "@param" tags, or replace it with a KillReason enum. Basically I 
want to be able to fill in this table:
|*Reason*|*wasFailure*|*generates stack*|
|Child threw exception|?|?|
|Job failed due to other tasks failing|?|?|
|Task timed out|?|?|
|Task killed by user (incl preemption)|?|?|
|Job killed by user|?|?|

Perhaps I'm just being dense, but the above isn't easy to understand from the 
existing code. If this was a "preexisting condition" (sorry!) then maybe it's 
out of scope for this JIRA.


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776303#action_12776303
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Seems as though the release audit warning checker is broken...

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776292#action_12776292
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12424559/MAPREDUCE-1119.3.patch
  against trunk revision 834284.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 159 release audit warnings 
(more than the trunk's current 0 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776076#action_12776076
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Seems so; TaskRunner and JvmManager are both package-visible classes. So that 
makes them MapReduce-internal. (But these classes are not yet annotated with 
our API classifications.)


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776008#action_12776008
 ] 

Doug Cutting commented on MAPREDUCE-1119:
-

> Is all this worth it just to avoid stack dumps in aborted speculative task 
> attempts?

Probably so, although I've not looked at the details.  A task timing out is a 
bad thing.  A speculative task getting killed is normal.  One should be able to 
easily scan a job's log for bad things.

These are all internal APIs, right?  If so, it shouldn't create problems to 
change them, should it?


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776003#action_12776003
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Actually, I suppose that if it comes from the JT, then it's definitely a 
speculative task attempt, right? Task attempt timeouts are actually between the 
attempt and the TT, and the JT isn't involved at all.

In the event of a timeout, markUnresponsiveTasks() calls 
TaskTracker.purgeTask(tip, wasFailure=true) calls 
tip.jobHasFinished(wasFailure) which calls tip.kill(wasFailure).

Unfortunately, here's where the train of failure/non-failure data for why the 
task should be killed, ends. This calls TaskRunner.kill() which calls 
JvmManager.taskKilled(this), which calls 
JvmManagerForType.taskKilled(taskRunner), calls JvmMgrForType.killJvm(jvmId), 
calls JvmRunner.kill(), calls 
TaskController.destroyTaskJvm(TaskControllerContext). (Someone please correct 
me if I'm wrong.)

But TaskRunner.kill() doesn't get a reason code like wasFailure. This could be 
changed, but then we'd also need to modify JvmManager, and add a 
synchronized/volatile call to hand off this data into the TaskControllerContext 
object. Is all this worth it just to avoid stack dumps in aborted speculative 
task attempts?


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775995#action_12775995
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Touche.

Does the jobtracker send information to the tasktracker, though, informing it 
of whether it is a timeout-based kill or simply a redundant speculative task 
attempt?


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-10 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775958#action_12775958
 ] 

Doug Cutting commented on MAPREDUCE-1119:
-

> receiving a SIGKILL implies that something has gone wrong in the process.

How are speculatively executed tasks terminated?


> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-09 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775200#action_12775200
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

To address all three reviews above:

* I'll rename the constant to SIGQUIT_TASK_JVM; understandability overrides 
consistency. "CORE_DUMP_JVM" is also a misnomer; if anything, the proper 
alternative would be "STACK_DUMP_JVM".
* I'll adapt the logic to go SIGQUIT, SIGKILL instead of SIGQUIT, SIGTERM, 
SIGKILL. Use the same constant (sleepTimeBeforeSigkill) as already exists.
* ProcessTree and DefaultTaskController both contain a lot of redundant code 
clones; if people are okay with it, I'm +1 on condensing a lot of the logic. 
The initial version of this patch didn't modify anything extraneous, but since 
the earlier versions of this patch make its intent clear, I'll write a broader 
one that cleans up the other code "in the neighborhood."
* I'll rename the current logic in {{finishTask()}} to {{sendSignal()}}; we can 
preserve a method named {{finishTask()}} that specifically sends SIGKILL using 
{{sendSignal()}}


| This currently causes stack traces for all killed tasks, right? I don't 
personally have a problem with that, but the description of the JIRA indicates 
that only those due to failing to report status will dump their stack, and it's 
worth noting the difference.

I've tried tracing backwards through the code to figure out what triggers JVM 
kills, but it's a lengthy string of methods involved. What other operations are 
there (besides task timeout) which wind up killing the task? The 
{{TaskController.destroyTaskJVM()}} method is only called from 
{{JvmManager.kill()}}, which itself receives no information as to why it's 
killing the JVM in question. (There are no arguments to this method; there's no 
flag in there to determine that the jvm under management has experienced a 
timeout.)

I actually claim that actually receiving a SIGKILL implies that something has 
gone wrong in the process. Tasks that clean up "politely" do not get the 
SIGKILL, and thus also do not get the SIGQUIT/dump stacks.  So if something's 
gone wrong, then in my mind, we should capture the stack trace for debug 
purposes.

Otherwise, we will need to modify the JvmManager API to be more precise about 
the nature of kills. I suppose I could live with that, but someone else needs 
to point me to where the information about this lives. So this would add 
complexity, and in my mind, be less useful.

* As for sigQuitProcessGroup(), I modeled the logic in dumpTaskStack() after 
that of all the other signalling methods. It's conceivable that a user's task 
process hangs because it spawns subprocesses and then gets in a deadlock based 
on IPC between the task and its subprocesses. I think the better question is: 
Is there a good reason *not* to send SIGQUIT to the entire child process group? 
(e.g., substantial overhead, especially that which blocks the TT?) I don't 
think there's any more overhead here on the TT's side than sending SIGTERM to 
the process group, which we already do. If we maintain this same logical 
structure in dumpTaskStack as we do in killTask, etc, then we can refactor this 
code down into a considerably more condensed form involving better code 
sharing. Otherwise, dumpTaskStack will have to remain a special case.




> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-09 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774916#action_12774916
 ] 

Hemanth Yamijala commented on MAPREDUCE-1119:
-

bq. LinuxTaskController.finishTask is now sort of a misnomer, since you're 
using it to send SIGQUIT. Maybe rename to sendKillSignal or something?

signalTask ? The task-controller was modeled on the lines of launch/kill a JVM, 
initialize/finish a job, initialize/finish task etc. It would be good to 
maintain those abstractions. I haven't looked at the patch, but can we make 
signalTask additional to the existing APIs. The implementation could of course 
reuse code. Thoughts ?

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-11-09 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774910#action_12774910
 ] 

Vinod K V commented on MAPREDUCE-1119:
--

bq. This currently causes stack traces for all killed tasks, right? I don't 
personally have a problem with that, but the description of the JIRA indicates 
that only those due to failing to report status will dump their stack, and it's 
worth noting the difference.
We shouldn't be doing this. Dump is needed only for tasks that TaskTracker 
forcefully kills them when it thinks that the task in question is hung.

bq. in destroyTaskJVM, you sleep for sleeptimeBeforeSigkill in between the 
SIGQUIT and the SIGKILL. This seems wrong - it should probably be a different 
timeout.
That timeout was added so that the process can do whatever cleanup it wishes to 
do before a SIGKILL arrives. So I think, it's fine to use the same 
configuration. But I think we should just skip SIGTERM and go directly to 
SIGKILL after SIGQUIT. We know for sure the task is hung and we are forcibly 
killing it, so skipping SIGTERM in this case will avoid further waiting before 
SIGKILL.

bq. I'd prefer SIGQUIT_TASK_JVM rather than QUIT_TASK_JVM for clarity's sake. 
It's a little less consistent, but more obvious for people reading the code 
later on.
+1. Either that or CORE_DUMP_TASK_JVM. No bias though.

bq. LinuxTaskController.finishTask is now sort of a misnomer, since you're 
using it to send SIGQUIT. Maybe rename to sendKillSignal or something?
+1

I have one more point. We need dump of the task JVM itsellf, and (I think) not 
of the child processes of the JVM(?) So no sigQuitProcessGroup()? Thoughts?

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-10-23 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769539#action_12769539
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
--

Contrib failure is unrelated

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-10-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769538#action_12769538
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12423061/MAPREDUCE-1119.2.patch
  against trunk revision 828979.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

2009-10-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769421#action_12769421
 ] 

Hadoop QA commented on MAPREDUCE-1119:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423057/MAPREDUCE-1119.patch
  against trunk revision 828979.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/205/console

This message is automatically generated.

> When tasks fail to report status, show tasks's stack dump before killing
> 
>
> Key: MAPREDUCE-1119
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.