how to skip single corrupted SequenceFile in SequenceFileInputFormat ?

Zhang Xiaoyu Thu, 11 Jun 2015 12:16:15 -0700

Hi, all,
My MR job (consumer pipeline) is using SequenceFileInputFormat as as the
input format in the MultipleInputs


for (FileStatus input : inputs) {
    MultipleInputs.addInputPath(job, myPath,
SequenceFileInputFormat.class, MyMapper.class);
}


My application will fail in condition that, when generator (use
SequenceFile.Writer) just create a zero size file, and keep append k-v to
it, but the content is not big enough so that nothing is flushed to the
file yet (even no blocking is generated), at this moment if the consumer
pipeline program kicked off and consume the file, it will treat it as a
corruption file with exception

java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
~[na:1.7.0_60-ea]
at java.io.DataInputStream.readFully(DataInputStream.java:169)
~[na:1.7.0_60-ea]
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
~[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1832)
[hadoop-common-2.2.0.jar:na]
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1752)
[hadoop-common-2.2.0.jar:na]
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
[hadoop-common-2.2.0.jar:na]
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
[hadoop-mapreduce-client-core-2.2.0.jar:na]
at
org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)
[hadoop-mapreduce-client-core-2.2.0.jar:na]
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
[hadoop-mapreduce-client-core-2.2.0.jar:na]
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
[hadoop-mapreduce-client-core-2.2.0.jar:na]
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
[hadoop-mapreduce-client-core-2.2.0.jar:na]
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
[hadoop-mapreduce-client-common-2.2.0.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[na:1.7.0_60-ea]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_60-ea]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60-ea]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60-ea]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_60-ea]

all the code is controlled in lib class, so there is not much thing I can
do in my MR job. So is there a way to skip a single *corrupted*
SequenceFile ?

another thing is when the program fail, and when I open vim the input file,
I found the file SEEMS has the proper header (SEQ, size, and etc..), so not
sure which part is corrected, maybe it is just timing, means when the read
happen, it doesn't has those header yet.

NOT SURE this will help but here is the header (plus a little bit content
maybe) of the "corrupted" file:

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable
^@^@^@^@^@^@ù<9a>ñ> <æfá#¬6<94>IÇ^@^@^@<8c>^@^@^@%$........


here is an empty sequence file, which is fine by consumer :

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable
^@^@^@^@^@^@<86>bÍI§ï8<97>ê=E^OÝ¢>^D

Any idea ? Thanks in advance.

Johnny

how to skip single corrupted SequenceFile in SequenceFileInputFormat ?

Reply via email to