[
https://issues.apache.org/jira/browse/HADOOP-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun C Murthy updated HADOOP-1332:
----------------------------------
Attachment: HADOOP-1332_1_20070529.patch
Nigel, could you try this patch and see if things improve?
Basically I've made both stdout & stderr run in separate threads (previously
stdout was run in the main thread) and seems like that fixes a weird case in
Windows where when the task dies the stdout stream doesn't get shut down
cleanly and hence the TaskTracker couldn't clean out the task... which easily
could lead to the issues you've run into.
Devaraj & I could run TestMiniMRLocalFS almost 100 times continuously with this
patch without issues; whereas previously it would hang in the same scenario.
> Sporadic unit test failures (TestMiniMRClasspath, TestMiniMRLocalFS,
> TestMiniMRDFSCaching)
> ------------------------------------------------------------------------------------------
>
> Key: HADOOP-1332
> URL: https://issues.apache.org/jira/browse/HADOOP-1332
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.13.0
> Reporter: Nigel Daley
> Assignee: Arun C Murthy
> Attachments: HADOOP-1332_1_20070529.patch
>
>
> Since April 22 I've been seeing sporadic failures of these tests on Windows:
> - TestMiniMRClasspath
> - TestMiniMRLocalFS
> - TestMiniMRDFSCaching
> The change log since then can be viewed starting at build 66:
> http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/66/
> The tests fail because they timeout.
> They timeout because one of the task trackers doesn't go idle.
> One of the task trackers doesn't go idle because a map fails and has to be
> killed.
> Reordered and annotated tests logs (my comments are in parenthesis):
> (map 0 executes and logs a 'done' and two 'completed' messages):
> [junit] 2007-05-03 21:19:40,516 INFO mapred.JobInProgress
> (JobInProgress.java:findNewTask(653)) - Choosing cached task tip_0001_m_000000
> [junit] 2007-05-03 21:19:40,516 INFO mapred.JobTracker
> (JobTracker.java:createTaskEntry(758)) - Adding task 'task_0001_m_000000_0'
> to tip tip_0001_m_000000, for tracker 'tracker_task000.com:2893'
> [junit] 2007-05-03 21:19:40,516 INFO mapred.TaskTracker
> (TaskTracker.java:startNewTask(1071)) - LaunchTaskAction: task_0001_m_000000_0
> [junit] 2007-05-03 21:19:45,655 INFO mapred.TaskTracker
> (TaskTracker.java:reportProgress(1284)) - task_0001_m_000000_0 1.0%
> hdfs://localhost:2882/testing/ext/input/part-0:0+10
> [junit] 2007-05-03 21:19:46,201 INFO mapred.TaskTracker
> (TaskTracker.java:reportProgress(1284)) - task_0001_m_000000_0 1.0%
> hdfs://localhost:2882/testing/ext/input/part-0:0+10
> [junit] 2007-05-03 21:19:46,201 INFO mapred.TaskTracker
> (TaskTracker.java:reportDone(1334)) - Task task_0001_m_000000_0 is done.
> [junit] 2007-05-03 21:19:46,357 INFO mapred.JobInProgress
> (JobInProgress.java:completedTask(734)) - Task 'task_0001_m_000000_0' has
> completed tip_0001_m_000000 successfully.
> [junit] 2007-05-03 21:19:46,357 INFO mapred.TaskInProgress
> (TaskInProgress.java:completedTask(475)) - Task 'task_0001_m_000000_0' has
> completed.
> (map 2 executes and logs a 'done' message but no 'completed' messages. It
> is eventually killed):
> [junit] 2007-05-03 21:19:40,594 INFO mapred.JobInProgress
> (JobInProgress.java:findNewTask(653)) - Choosing cached task tip_0001_m_000002
> [junit] 2007-05-03 21:19:40,594 INFO mapred.JobTracker
> (JobTracker.java:createTaskEntry(758)) - Adding task 'task_0001_m_000002_0'
> to tip tip_0001_m_000002, for tracker 'tracker_task002.com:2902'
> [junit] 2007-05-03 21:19:40,594 INFO mapred.TaskTracker
> (TaskTracker.java:startNewTask(1071)) - LaunchTaskAction: task_0001_m_000002_0
> [junit] 2007-05-03 21:19:46,295 INFO mapred.TaskTracker
> (TaskTracker.java:reportProgress(1284)) - task_0001_m_000002_0 0.0%
> hdfs://localhost:2882/testing/ext/input/part-0:20+10
> [junit] 2007-05-03 21:19:46,310 INFO mapred.TaskTracker
> (TaskTracker.java:reportDone(1334)) - Task task_0001_m_000002_0 is done.
> ...
> [junit] 2007-05-03 21:29:52,957 INFO mapred.TaskTracker
> (TaskTracker.java:markUnresponsiveTasks(909)) - task_0001_m_000002_0: Task
> failed to report status for 606 seconds. Killing.
> (long thread dump)
> (shutting down MiniMRCluster)
> [junit] Waiting for task tracker tracker_task002.com:2902 to be idle.
> [junit] Waiting for task tracker tracker_task002.com:2902 to be idle.
> [junit] Waiting for task tracker tracker_task002.com:2902 to be idle.
> [junit] Waiting for task tracker tracker_task002.com:2902 to be idle.
> [junit] Waiting for task tracker tracker_task002.com:2902 to be idle.
> [junit] Waiting for task tracker tracker_task002.com:2902 to be idle.
> ... (repeated until the test times out)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.