Large #of tasks failing at one time can effectively hang the jobtracker 
------------------------------------------------------------------------

                 Key: HADOOP-3120
                 URL: https://issues.apache.org/jira/browse/HADOOP-3120
             Project: Hadoop Core
          Issue Type: Bug
         Environment: Linux/Hadoop-15.3
            Reporter: Pete Wyckoff
            Priority: Minor


We think that JobTracker.removeMarkedTaks does so much logging when this 
happens (ie logging thousands of failed taks per cycle) that nothing else can 
go on (since it's called from a synchronized method) and thus by the next 
cycle, the next waves of jobs have failed and we again have 10s of thousands of 
failures to log and on and on.

At least, the above is what we observed - just a continual printing of those 
failures and nothing else happening on and on. Of course the original jobs may 
have ultimately failed but new jobs come in to perpetuate the problem.

This has happened to us a number of times and since we commented out the 
log.info in that method we haven't had any problems. Although thousands and 
thousands of task failures are hopefully not that common.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to