Hi, all, My MR job (consumer pipeline) is using SequenceFileInputFormat as as the input format in the MultipleInputs
for (FileStatus input : inputs) { MultipleInputs.addInputPath(job, myPath, SequenceFileInputFormat.class, MyMapper.class); } My application will fail in condition that, when generator (use SequenceFile.Writer) just create a zero size file, and keep append k-v to it, but the content is not big enough so that nothing is flushed to the file yet (even no blocking is generated), at this moment if the consumer pipeline program kicked off and consume the file, it will treat it as a corruption file with exception java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) ~[na:1.7.0_60-ea] at java.io.DataInputStream.readFully(DataInputStream.java:169) ~[na:1.7.0_60-ea] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146) ~[hadoop-common-2.2.0.jar:na] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) [hadoop-common-2.2.0.jar:na] at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1832) [hadoop-common-2.2.0.jar:na] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1752) [hadoop-common-2.2.0.jar:na] at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773) [hadoop-common-2.2.0.jar:na] at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54) [hadoop-mapreduce-client-core-2.2.0.jar:na] at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84) [hadoop-mapreduce-client-core-2.2.0.jar:na] at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524) [hadoop-mapreduce-client-core-2.2.0.jar:na] at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762) [hadoop-mapreduce-client-core-2.2.0.jar:na] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) [hadoop-mapreduce-client-core-2.2.0.jar:na] at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235) [hadoop-mapreduce-client-common-2.2.0.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_60-ea] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_60-ea] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60-ea] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60-ea] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_60-ea] all the code is controlled in lib class, so there is not much thing I can do in my MR job. So is there a way to skip a single *corrupted* SequenceFile ? another thing is when the program fail, and when I open vim the input file, I found the file SEEMS has the proper header (SEQ, size, and etc..), so not sure which part is corrected, maybe it is just timing, means when the read happen, it doesn't has those header yet. NOT SURE this will help but here is the header (plus a little bit content maybe) of the "corrupted" file: SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable ^@^@^@^@^@^@ù<9a>ñ> <æfá#¬6<94>IÇ^@^@^@<8c>^@^@^@%$........ here is an empty sequence file, which is fine by consumer : SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable ^@^@^@^@^@^@<86>bÍI§ï8<97>ê=E^OÝ¢>^D Any idea ? Thanks in advance. Johnny