Re: Exception while slicing and parsing old segments without fetching

srinivasarao v Wed, 25 Nov 2009 21:18:08 -0800

Hi Vishal,

I got the same prolem while runing updatedb and invertlinks.
Have you got the solution to the problem?
Please let me know if u get the solution.


Thank You,
Srinivas

On Mon, Aug 24, 2009 at 2:00 PM, vishal vachhani <[email protected]>wrote:

> Hi All,
>         I had a big segment(size= 25 GB). Using "mergesegs utility and
> slice=20000" , I have divided the segment into around 400 small segments. I
> re-paresed(using parse command) all the segments because we have made
> changes into the parsing modules of Nutch. Parsing was completed
> successfully for all segments. Linkdb is also generated successfully.
>
> I have following questions.
>
> 1. Do I need to run "Updatedb" on the parsed segments again? When I run
> Updatedb command on these segments, I am getting following exception.
>
> ----------------------------------------------------------------------------------------
> 2009-08-17 20:09:33,679 WARN  fs.FSInputChecker - Problem reading checksum
> file: java.io.EOFException. Ignoring.
> 2009-08-17 20:09:33,700 WARN  mapred.LocalJobRunner - job_fmwtmv
> java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0,
> summed=3584, read=4096, bytesPerSum=1, inSum=512
>        at
>
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:201)
>        at
>
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>        at
>
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at
>
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
>        at
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>        at java.util.zip.CRC32.update(CRC32.java:43)
>        at
>
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:199)
>        ... 16 more
> 2009-08-17 20:09:33,749 FATAL crawl.CrawlDb - CrawlDb update:
> java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
>        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:199)
>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:152)
>
> -----------------------------------------------------------------------------------------------------------------------------
>
> 2. When I run the "index" command on the segments,crawldb and linkdb, I am
> getting "java heap space" error. While with single big segments and same
> configuration of Java heap, we were able to index the segments. Are we
> doing
> something wrong? We will be thankful if somebody could give us some
> pointers
> in the problems.
>
> --------------------------------------------------------------------------------
>
>  java.lang.OutOfMemoryError: Java heap space
>        at java.util.Arrays.copyOf(Arrays.java:2786)
>        at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at org.apache.hadoop.io.Text.writeString(Text.java:399)
>        at org.apache.nutch.metadata.Metadata.write(Metadata.java:225)
>        at org.apache.nutch.parse.ParseData.write(ParseData.java:165)
>        at
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154)
>        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315)
>        at org.apache.nutch.indexer.Indexer.map(Indexer.java:362)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> 2009-08-23 23:19:28,569 FATAL indexer.Indexer - Indexer:
> java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at org.apache.nutch.indexer.Indexer.index(Indexer.java:329)
>        at org.apache.nutch.indexer.Indexer.run(Indexer.java:351)
>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>        at org.apache.nutch.indexer.Indexer.main(Indexer.java:334)
>
> ----------------------------------------------------------------------------------------------------------------
>
>
> --
> Thanks and Regards,
> Vishal Vachhani
>



-- 
http://cheyuta.wordpress.com

Re: Exception while slicing and parsing old segments without fetching

Reply via email to