Nutch does not work with a big amount of urls. And it uses several tens of
hdd gigabytes (about 80gb) to fetch, merge segments and index about 30-40k
urls. Is it ok?
Please let me know where the problem is.
I see this in hadoop.log:
2009-03-05 01:39:10,227 INFO crawl.CrawlDb - CrawlDb update: done
2009-03-05 01:39:12,150 INFO segment.SegmentMerger - Merging 2 segments to
crawl/MERGEDsegments/20090305013912
2009-03-05 01:39:12,160 INFO segment.SegmentMerger - SegmentMerger:
adding crawl/segments/20090304102421
2009-03-05 01:39:12,190 INFO segment.SegmentMerger - SegmentMerger:
adding crawl/segments/20090304165203
2009-03-05 01:39:12,195 INFO segment.SegmentMerger - SegmentMerger: using
segment data from: content crawl_generate crawl_fetch crawl_parse parse_data
parse_text
2009-03-05 01:39:12,250 WARN mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2009-03-05 15:52:23,822 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
2009-03-05 15:52:28,831 INFO crawl.LinkDb - LinkDb: starting
2009-03-05 15:52:28,832 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2009-03-05 15:52:28,832 INFO crawl.LinkDb - LinkDb: URL normalize: true
2009-03-05 15:52:28,833 INFO crawl.LinkDb - LinkDb: URL filter: true
2009-03-05 15:52:28,873 INFO crawl.LinkDb - LinkDb: adding segment:
crawl/segments/20090305013912
2009-03-05 15:52:32,950 WARN mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/fss/nutch/crawl/segments/20090305013912/parse_data/part-00000/data
at 6814720
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:211)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
at
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
2009-03-05 15:52:33,937 FATAL crawl.LinkDb - LinkDb: java.io.IOException:
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
2009-03-05 15:52:34,808 INFO indexer.Indexer - Indexer: starting
2009-03-05 15:52:34,822 INFO indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2009-03-05 15:52:34,822 INFO indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2009-03-05 15:52:34,822 INFO indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20090305013912
2009-03-05 15:52:36,797 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-03-05 15:52:36,898 WARN mime.MimeTypesReader - Not a <mime-info/>
configuration document
2009-03-05 15:52:36,898 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2009-03-05 15:52:37,620 WARN mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/fss/nutch/crawl/segments/20090305013912/crawl_fetch/part-00000/data
at 1047552
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:211)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:238)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
at
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
2009-03-05 15:52:37,627 FATAL indexer.Indexer - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
--
View this message in context:
http://www.nabble.com/Segments-merging-and-indexing-errors-tp22361996p22361996.html
Sent from the Nutch - User mailing list archive at Nabble.com.