Hi All,
I had a big segment(size= 25 GB). Using "mergesegs utility and
slice=20000" , I have divided the segment into around 400 small segments. I
re-paresed(using parse command) all the segments because we have made
changes into the parsing modules of Nutch. Parsing was completed
successfully for all segments. Linkdb is also generated successfully.
I have following questions.
1. Do I need to run "Updatedb" on the parsed segments again? When I run
Updatedb command on these segments, I am getting following exception.
----------------------------------------------------------------------------------------
2009-08-17 20:09:33,679 WARN fs.FSInputChecker - Problem reading checksum
file: java.io.EOFException. Ignoring.
2009-08-17 20:09:33,700 WARN mapred.LocalJobRunner - job_fmwtmv
java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0,
summed=3584, read=4096, bytesPerSum=1, inSum=512
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:201)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.util.zip.CRC32.update(CRC32.java:43)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:199)
... 16 more
2009-08-17 20:09:33,749 FATAL crawl.CrawlDb - CrawlDb update:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:199)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:152)
-----------------------------------------------------------------------------------------------------------------------------
2. When I run the "index" command on the segments,crawldb and linkdb, I am
getting "java heap space" error. While with single big segments and same
configuration of Java heap, we were able to index the segments. Are we doing
something wrong? We will be thankful if somebody could give us some pointers
in the problems.
--------------------------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.Text.writeString(Text.java:399)
at org.apache.nutch.metadata.Metadata.write(Metadata.java:225)
at org.apache.nutch.parse.ParseData.write(ParseData.java:165)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315)
at org.apache.nutch.indexer.Indexer.map(Indexer.java:362)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2009-08-23 23:19:28,569 FATAL indexer.Indexer - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:329)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:351)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:334)
----------------------------------------------------------------------------------------------------------------
--
Thanks and Regards,
Vishal Vachhani