Hi Vishal, I got the same prolem while runing updatedb and invertlinks. Have you got the solution to the problem? Please let me know if u get the solution.
Thank You, Srinivas On Mon, Aug 24, 2009 at 2:00 PM, vishal vachhani <[email protected]>wrote: > Hi All, > I had a big segment(size= 25 GB). Using "mergesegs utility and > slice=20000" , I have divided the segment into around 400 small segments. I > re-paresed(using parse command) all the segments because we have made > changes into the parsing modules of Nutch. Parsing was completed > successfully for all segments. Linkdb is also generated successfully. > > I have following questions. > > 1. Do I need to run "Updatedb" on the parsed segments again? When I run > Updatedb command on these segments, I am getting following exception. > > ---------------------------------------------------------------------------------------- > 2009-08-17 20:09:33,679 WARN fs.FSInputChecker - Problem reading checksum > file: java.io.EOFException. Ignoring. > 2009-08-17 20:09:33,700 WARN mapred.LocalJobRunner - job_fmwtmv > java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0, > summed=3584, read=4096, bytesPerSum=1, inSum=512 > at > > org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:201) > at > > org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167) > at > > org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at java.io.DataInputStream.readFully(DataInputStream.java:178) > at > > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482) > at > > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73) > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.util.zip.CRC32.update(CRC32.java:43) > at > > org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:199) > ... 16 more > 2009-08-17 20:09:33,749 FATAL crawl.CrawlDb - CrawlDb update: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97) > at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:199) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:152) > > ----------------------------------------------------------------------------------------------------------------------------- > > 2. When I run the "index" command on the segments,crawldb and linkdb, I am > getting "java heap space" error. While with single big segments and same > configuration of Java heap, we were able to index the segments. Are we > doing > something wrong? We will be thankful if somebody could give us some > pointers > in the problems. > > -------------------------------------------------------------------------------- > > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2786) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at org.apache.hadoop.io.Text.writeString(Text.java:399) > at org.apache.nutch.metadata.Metadata.write(Metadata.java:225) > at org.apache.nutch.parse.ParseData.write(ParseData.java:165) > at > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154) > at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315) > at org.apache.nutch.indexer.Indexer.map(Indexer.java:362) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > 2009-08-23 23:19:28,569 FATAL indexer.Indexer - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.indexer.Indexer.index(Indexer.java:329) > at org.apache.nutch.indexer.Indexer.run(Indexer.java:351) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.indexer.Indexer.main(Indexer.java:334) > > ---------------------------------------------------------------------------------------------------------------- > > > -- > Thanks and Regards, > Vishal Vachhani > -- http://cheyuta.wordpress.com
