[jira] Commented: (HADOOP-1763) Too many lost task trackers - Job failures

Devaraj Das (JIRA) Mon, 27 Aug 2007 02:20:05 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522959
 ]


Devaraj Das commented on HADOOP-1763:
-------------------------------------

Enis, are you saying that the jetty issue cause lost tasktrackers. That doesn't 
seem to be the case since the tasktrackers, in fact, are doing the heartbeats 
and hence they won't be lost. 
So I think Srikanth is not hitting HADOOP-1158. IMHO we might be hitting some 
scalability issue - it might help in the short term to see how things behave 
with an increased number of JobTracker RPC server handlers. In the code, the 
number of RPC handlers is hardcoded to 10. It would be interesting to see if 
things become better with 40.  

> Too many lost task trackers - Job failures
> ------------------------------------------
>
>                 Key: HADOOP-1763
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1763
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: Version: 0.15.0-dev, r565628
> Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
> 1400 Node cluster
>            Reporter: Srikanth Kakani
>            Priority: Blocker
>
> Steps to reproduce:
> 1 .Run a map reduce application running more than 3000 mappers, each running 
> longer than
> 2. Observe the lost task trackers.
> Observations:
> 1. Most of the lost taskTracker messages correspond to maps that have already 
> completed
> 2. Based on the logs below the taskTracker is unable to connect to the job 
> tracker and so the jobTracker deletes the job after 20 minutes
> One example:
> task_200708210155_0003_m_000000_0     node1   KILLED  0.00%           
> 21-Aug-2007 09:39:09    Lost task tracker   <-- Please note the time
> Counters:
> Map-Reduce Framework
>       Map input records       28,861
>       Map output records      1,349,114
>       Map input bytes         200,018,562
>       Map output bytes        714,878,712
> Node 1 task tracker logs:
> 2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task 
> task_200708210155_0003_m_000000_0 is done. <-- Please note the time
> .
> .
> .
> 2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS 
> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
> 2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: 
> [EMAIL PROTECTED]:50060
> .
> .
> .
> 2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
> exception: java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:472)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
>         at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
>         at 
> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941)
>         at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911)
> .
> .
> .
> 2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
> 'status' to 'wm501219' with reponseId '5247
> 2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved 
> RenitTrackerAction from JobTracker
> 2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_200708210155_0003_m_000000_0 done; removing files.
> 2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_200708210155_0003_m_002237_0 done; removing files.
> Tasktracker is pretty active otherwise:
> tracker_wm511293.inktomisearch.com:50050      wm511293.inktomisearch.com      
> 1       6       3
> JobTracker logs:
> 2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
> 'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, 
> for tracker 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
> 'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, 
> for tracker 'tracker_wm511783.inktomisearch.com:50050'
> .
> 2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 
> 'task_200708210155_0003_m_000000_0' has completed 
> tip_200708210155_0003_m_000000 successfully.
> 2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 
> 'task_200708210155_0003_m_000000_0' has completed succesfully
> .
> 2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already 
> complete TIP tip_200708210155_0003_m_000000 has completed task 
> task_200708210155_0003_m_000000_1
> .
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from task_200708210155_0003_m_000000_0: Lost task tracker
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 
> 'task_200708210155_0003_m_000000_0' has been lost.
> .
> 2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed 
> completed task 'task_200708210155_0003_m_000000_0' from 
> 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from task_200708210155_0003_m_000000_1: Lost task tracker
> 2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: 
> Recieved duplicate status update of 'KILLED' for 
> 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000'
> Notes:
> 1. I do  not see the taskTracker dying during that period
> 2. Is retry logic not accurate/agressive enough? (did something change 
> recently, this behavior is more evident in 0.15)
> 3. Inconsistencies with jobTracker logs? Lost task tracker detection bad?
> 4. TaskTracker:
>           CPU usage: 9:10-9:20 50%
>                                   9:20-9:40 0%
>           Network Usage: 6M incl dfs operations
> 5. JobTracker
>           CPU udage: Avg: 9%
>           Network Usage:  Negligible

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1763) Too many lost task trackers - Job failures

Reply via email to