Large #of tasks failing at one time can effectively hang the jobtracker
------------------------------------------------------------------------
Key: HADOOP-3120
URL: https://issues.apache.org/jira/browse/HADOOP-3120
Project: Hadoop Core
Issue Type: Bug
Environment: Linux/Hadoop-15.3
Reporter: Pete Wyckoff
Priority: Minor
We think that JobTracker.removeMarkedTaks does so much logging when this
happens (ie logging thousands of failed taks per cycle) that nothing else can
go on (since it's called from a synchronized method) and thus by the next
cycle, the next waves of jobs have failed and we again have 10s of thousands of
failures to log and on and on.
At least, the above is what we observed - just a continual printing of those
failures and nothing else happening on and on. Of course the original jobs may
have ultimately failed but new jobs come in to perpetuate the problem.
This has happened to us a number of times and since we commented out the
log.info in that method we haven't had any problems. Although thousands and
thousands of task failures are hopefully not that common.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.