Colin, how about writing a streaming mapper which simply runs md5sum on each file it gets as input? Run this task along with the identity reducer, and you should be able to identify pretty quickly if there's HDFS corruption issue.
Norbert On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas <[EMAIL PROTECTED]> wrote: > so, in an attempt to track down this problem, i've stripped out most of > the > files for input, trying to identify which ones are causing the problem. > > i've narrowed it down, but i can't pinpoint it. i keep getting these > incorrect data check errors below, but the .gz files test fine with gzip. > > is there some way to run an md5 or something on the files in hdfs and > compare it to the checksum of the files on my local machine? > > i've looked around the lists and through the various options to send to > .../bin/hadoop, but nothing is jumping out at me. > > this is particularly frustrating because it's causing my jobs to fail, > rather than skipping the problematic input files. i've also looked > through > the conf file and don't see anything similar about skipping bad files > without killing the job. > > -colin > > > On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas <[EMAIL PROTECTED]> wrote: > > > running a job on my 5 node cluster, i get these intermittent exceptions > in > > my logs: > > > > java.io.IOException: incorrect data check > > at > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native > Method) > > > > at > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218) > > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) > > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) > > > > at java.io.InputStream.read(InputStream.java:89) > > at > org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88) > > at > org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114) > > > > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215) > > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37) > > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) > > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) > > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084 > > > > > > they occur accross all the nodes, but i can't figure out which file is > > causing the problem. i'm working on the assumption it's a specific file > > because it's precisely the same error that occurs on each node. i've > > scoured the logs and can't find any reference to which file caused the > > hiccup. but this is causing the job to fail. other files are processed > > without a problem. the files are 720 .gz files, ~100mb each. other > files > > are processed on each node without a problem. i'm in the middle testing > the > > .gz files, but i don't think the problem is necessarily in the source > data, > > as much as in when i copied it into hdfs. > > > > so my questions are these: > > is this a known issue? > > is there some way to determine which file or files are causing these > > exceptions? > > is there a way to run something like "gzip -t blah.gz" on the file in > > hdfs? or maybe a checksum? > > is there a reason other than a corrupt datafile that would be causing > > this? > > in the original mapreduce paper, they talk about a mechanism to skip > > records that cause problems. is there a way to have hadoop skip these > > problematic files and the associated records and continue with the job? > > > > > > thanks, > > colin > > >