(nutch-nightly, hadoop 0.9.1)
Got this in a nightly crawl of 40K more pages to a ~150K nutch db.
Crawl has run fine the past five nights with same settings and script.
The error happened during the nutch mergesegs part of the re-crawl
cycle. It crashes the mergesegs with
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
399)
at org.apache.nutch.segment.SegmentMerger.merge
(SegmentMerger.java:547)
at org.apache.nutch.segment.SegmentMerger.main
(SegmentMerger.java:595)
And then the following re-crawl commands (invert, index, dedup) fail,
leaving me with a corrupt webdb/index.
The error below is in my hadoop log.
The file indicated (bad_files/data.-931801681) is a 255MB binary file
-- running strings on it shows a lot of URIs. There's also a
2MB .data.crc-931801681 file, all binary.
Any idea how this happened or how to avoid?
2007-01-15 01:56:52,303 INFO mapred.MapTask - opened part-0.out
2007-01-15 01:56:52,696 WARN dfs.DistributedFileSystem - Moving bad
file /array/nutch-nightly/crawl/segments/20070114192132/content/
part-00000/data to /array/nutch-nightly/bad_files/data.-931801681
2007-01-15 01:56:52,739 WARN mapred.LocalJobRunner - job_u5iokg
org.apache.hadoop.fs.ChecksumException: Checksum error: /array/nutch-
nightly/crawl/segments/20070114192132/content/part-00000/data at 2387968
at org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum
(FSDataInputStream.java:138)
at org.apache.hadoop.fs.FSDataInputStream$Checker.read
(FSDataInputStream.java:114)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read
(FSDataInputStream.java:189)
at java.io.BufferedInputStream.read1
(BufferedInputStream.java:254)
at java.io.BufferedInputStream.read(BufferedInputStream.java:
313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write
(DataOutputBuffer.java:57)
at org.apache.hadoop.io.DataOutputBuffer.write
(DataOutputBuffer.java:91)
at org.apache.hadoop.io.SequenceFile$Reader.next
(SequenceFile.java:1280)
at org.apache.hadoop.io.SequenceFile$Reader.next
(SequenceFile.java:1191)
at org.apache.hadoop.io.SequenceFile$Reader.next
(SequenceFile.java:1237)
at org.apache.hadoop.mapred.SequenceFileRecordReader.next
(SequenceFileRecordReader.java:71)
at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat
$1.next(SegmentMerger.java:123)
at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215)
--
http://variogr.am/
[EMAIL PROTECTED]
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general