Hi, For some reason the fetcher sometimes produces corrupts unreadable segments. It then exists with exception like "problem advancing post", or "negative array size exception" etc.
java.lang.RuntimeException: problem advancing post rec#702 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1225) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246) at org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1431) at org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1392) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at org.apache.hadoop.io.Text.readString(Text.java:402) at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) at org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:70) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1282) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1222) ... 7 more 2013-05-26 22:41:41,344 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1520) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1556) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1529) These errors produce the following exception when trying to index. java.io.IOException: IO error in map input file file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:242) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000 at 2620416 at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236) ... 5 more Is there any way we can debug this? The errors is usually related to Nutch reading metadata, but since we cannot read the metadata, i cannot know what data is causing the issue :) Any hints to share on how to tackle these issues? Markus