[ 
https://issues.apache.org/jira/browse/HADOOP-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-4595:
--------------------------------

    Attachment: 4595.patch

This patch fixes a race condition in updating free slot count when the load is 
high (leading to lost TTs). When a TT reinits, the TaskLauncher object is 
created again. A task that is currently running might end up incrementing the 
free slots of the new TaskLauncher object if it takes time to exit. This would 
lead to the behavior described by Aaron in the bug report. The patch fixes this 
by moving all code to do with incrementing free slots to one method and is done 
inline in TaskInProgress.kill

In addition, the patch fixes a race condition to do with starting 
MapEventsFetcher thread. The thread starts the loop after looking at 
TaskTracker.running flag. However, when a TT reinits, the running field is set 
to true only after the thread is spawned. If the thread is immediately 
scheduled, it will find running false and exit. This would lead to hung reduces.

I also cleaned up some code to do with TIP.cleanup during a task launch.


> JVM Reuse triggers RuntimeException("Invalid state")
> ----------------------------------------------------
>
>                 Key: HADOOP-4595
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4595
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Aaron Kimball
>            Assignee: Devaraj Das
>             Fix For: 0.19.0
>
>         Attachments: 4595.patch
>
>
> A Reducer triggers the following exception:
> 08/11/05 08:58:50 INFO mapred.JobClient: Task Id : 
> attempt_200811040110_0230_r_000008_1, Status : FAILED
> java.lang.RuntimeException: Inconsistent state!!! JVM Manager reached an 
> unstable state while reaping a JVM for task: 
> attempt_200811040110_0230_r_000008_1 Number of active JVMs:2
>  JVMId jvm_200811040110_0230_r_-735233075 #Tasks ran: 0 Currently busy? true 
> Currently running: attempt_200811040110_0230_r_000012_0
>  JVMId jvm_200811040110_0230_r_-1716942642 #Tasks ran: 0 Currently busy? true 
> Currently running: attempt_200811040110_0230_r_000040_0
>    at java.lang.Throwable.<init>(Throwable.java:67)
>    at 
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:245)
>    at 
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113)
>    at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:78)
>    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:410) 
> Other clues:
> In the three reduce task attempts where this was observed, this was attempt 
> _1. Attempt _0 had started and eventually switches to "SUCCEEDED." So I think 
> this is happening only on speculatively-executed reduce task attempts. The 
> reduce output (part-XXXXX) gets lost when this attempt fails, even though the 
> other (earlier) attempt succeeded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to