Re: Nutch Crash during db update

vishal vachhani Wed, 02 Sep 2009 03:44:55 -0700

I have no idea. But since we are not able to dump it, that means problem
lies in segment only not with updatedb!!



On Wed, Sep 2, 2009 at 4:02 PM, zzeran <[email protected]> wrote:

>
> Hi Vishal,
>
> Thanks for your help.
>
> I've tried dumping the segment like you suggested and indeed, I've got the
> following error message:
>
> SegmentReader: dump segment: /user/nutch/crawl/segments/20090901230006
> java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:197)
>         at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
> ava:63)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
> )
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
>        at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
> cordReader.java:76)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
> sk.java:192)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
> a:176)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:197)
>         at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
> ava:63)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
> )
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
>        at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
> cordReader.java:76)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
> sk.java:192)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
> a:176)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:197)
>         at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
> ava:63)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
> )
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
>        at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
> cordReader.java:76)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
> sk.java:192)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
> a:176)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>         at
> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
>        at
> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
>
> So according to what you've said - the segment got corrupted. Is this
> really
> the case? why was it corrupted? anyway I can avoid it?
>
> Thanks,
> Eran
>
>
> vishal vachhani wrote:
> >
> > I have also seen this exception. However, I am not able to exactly figure
> > out why it is happening. But when I dump my segment using "readseg", it
> is
> > also throwing exception. I suspect that my segment got corrupted.  Please
> > try to dump you segments and check whether it is getting dumped or not.
> >
> > let me know if you able to solve the problem.
> >
> >
> > On Wed, Sep 2, 2009 at 2:23 PM, zzeran <[email protected]> wrote:
> >
> >>
> >> Hi,
> >>
> >> I'm new to Nutch and so far very impressed!
> >>
> >> I've been investigating Nutch for the past two weeks and now I've
> started
> >> fetching pages from the internet (I'm doing a specific crawl on a few
> >> selected domains).
> >>
> >> I'm running Nutch using Hadoop on two machines, using the DFS and using
> >> Ubuntu 9.04.
> >>
> >> I've tried running Nutch several times but everytime it seems to be
> >> crashing
> >> after 4-5 hours of crawling (even after I've formatted the DFS and
> >> restarted
> >> crawl).
> >>
> >> I've created a "loop" in a shell script that execute all the crawling
> >> phases. On the loop that caused the crash (after 4-5 hours) I'm getting
> >> the
> >> following:
> >>
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: starting
> >> Generator: segment: crawl/segments/20090901230006
> >> Generator: filtering: true
> >> Generator: topN: 10000
> >> Generator: Partitioning selected urls by host, for politeness.
> >> Generator: done.
> >> processing segment /user/nutch/crawl/segments/20090901230006
> >> Fetcher: starting
> >> Fetcher: segment: /user/nutch/crawl/segments/20090901230006
> >> Fetcher: done
> >> CrawlDb update: starting
> >> CrawlDb update: db: crawl/crawldb
> >> CrawlDb update: segments: [/user/nutch/crawl/segments/20090901230006]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> java.io.EOFException
> >>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >>    at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
> >>    at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
> >>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
> >>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>
> >> java.io.EOFException
> >>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >>    at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
> >>    at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
> >>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
> >>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>
> >> java.io.EOFException
> >>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >>    at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >>    at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
> >>    at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
> >>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
> >>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>
> >> CrawlDb update: java.io.IOException: Job failed!
> >>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
> >>    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
> >>    at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
> >>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>    at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
> >>
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Eran
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25253922.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Thanks and Regards,
> > Vishal Vachhani
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25255115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Thanks and Regards,
Vishal Vachhani

Re: Nutch Crash during db update

Reply via email to