Hi Vishal,
Thanks for your help.
I've tried dumping the segment like you suggested and indeed, I've got the
following error message:
SegmentReader: dump segment: /user/nutch/crawl/segments/20090901230006
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
ava:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
cordReader.java:76)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
sk.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
a:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
ava:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
cordReader.java:76)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
sk.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
a:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
ava:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
cordReader.java:76)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
sk.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
a:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
So according to what you've said - the segment got corrupted. Is this really
the case? why was it corrupted? anyway I can avoid it?
Thanks,
Eran
vishal vachhani wrote:
>
> I have also seen this exception. However, I am not able to exactly figure
> out why it is happening. But when I dump my segment using "readseg", it is
> also throwing exception. I suspect that my segment got corrupted. Please
> try to dump you segments and check whether it is getting dumped or not.
>
> let me know if you able to solve the problem.
>
>
> On Wed, Sep 2, 2009 at 2:23 PM, zzeran <[email protected]> wrote:
>
>>
>> Hi,
>>
>> I'm new to Nutch and so far very impressed!
>>
>> I've been investigating Nutch for the past two weeks and now I've started
>> fetching pages from the internet (I'm doing a specific crawl on a few
>> selected domains).
>>
>> I'm running Nutch using Hadoop on two machines, using the DFS and using
>> Ubuntu 9.04.
>>
>> I've tried running Nutch several times but everytime it seems to be
>> crashing
>> after 4-5 hours of crawling (even after I've formatted the DFS and
>> restarted
>> crawl).
>>
>> I've created a "loop" in a shell script that execute all the crawling
>> phases. On the loop that caused the crash (after 4-5 hours) I'm getting
>> the
>> following:
>>
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl/segments/20090901230006
>> Generator: filtering: true
>> Generator: topN: 10000
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> processing segment /user/nutch/crawl/segments/20090901230006
>> Fetcher: starting
>> Fetcher: segment: /user/nutch/crawl/segments/20090901230006
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl/crawldb
>> CrawlDb update: segments: [/user/nutch/crawl/segments/20090901230006]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> java.io.EOFException
>> at java.io.DataInputStream.readFully(DataInputStream.java:197)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>> at
>>
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>> at
>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>> at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> java.io.EOFException
>> at java.io.DataInputStream.readFully(DataInputStream.java:197)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>> at
>>
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>> at
>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>> at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> java.io.EOFException
>> at java.io.DataInputStream.readFully(DataInputStream.java:197)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>> at
>>
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>> at
>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>> at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> CrawlDb update: java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
>> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
>>
>>
>> Any ideas?
>>
>> Thanks,
>> Eran
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25253922.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Thanks and Regards,
> Vishal Vachhani
>
>
--
View this message in context:
http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25255115.html
Sent from the Nutch - User mailing list archive at Nabble.com.