#1 Check CPU fan is working.  A hot CPU can give flakey errors....especially 
during high CPU load.
#2 Do memtest on the machine.  You might have a bad memory stick that is 
getting hit (though I
would tend to think it would be a bit more random).
I've used memtest86 before to find such problems.
http://www.memtest86.com
#3 Check the disk.  Hopefully smartctrl is on your system which would tell you 
if any disk errors are occurring.  Otherwise use the manufacturers disk 
testing.

Or...
If you can, move the CPU and/or memory and/or disk between two machines and 
see if the problem migrates to the other machine.  I'd probably do all 3 at 
once just to confirm it's one of them, then move them back one at a time.



Michael D. Black
Senior Scientist
Nothrop Grumman Information Systems
Advanced Analytics Directorate



-----Original Message-----
From: Jim Twensky [mailto:jim.twen...@gmail.com]
Sent: Thursday, December 23, 2010 4:37 PM
To: core-u...@hadoop.apache.org
Subject: EXTERNAL:Tasktracker failing and getting black listed

Hi,

I have a 16+1 node hadoop cluster where all tasktrackers (and
datanodes) are connected to the same switch and share the exact same
hardware and software configuration. When I run a hadoop job, one of
the task trackers always produces one of these two errors ONLY during
the reduce tasks and gets blacklisted eventually.

---------------------------------------------------------------------------------
org.apache.hadoop.fs.ChecksumException: Checksum Error
        at 
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
        at 
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
        at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
        at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
        at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
        at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:374)
        at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
        at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
        at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
        at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:111)
        at 
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:86)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:173)
        at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1214)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
        at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
---------------------------------------------------------------------------------

or

---------------------------------------------------------------------------------
java.lang.RuntimeException: next value iterator failed
        at 
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:160)
        at src.expinions.PhraseGen.ReduceClass.reduce(ReduceClass.java:17)
        at src.expinions.PhraseGen.ReduceClass.reduce(ReduceClass.java:10)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
        at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1214)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
        at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
        at 
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
        at 
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
        at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
        at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
        at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
        at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
        at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
        at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
        at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
        at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:111)
        at 
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:157
---------------------------------------------------------------------------------

It is always the same node, and it can successfully run the map tasks
without problems. I double checked the available disk space and other
settings and couldn't find anything different. I also tried to run
different jobs and different input but the result is always the same.

Any ideas?

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to