I have no idea. But since we are not able to dump it, that means problem lies in segment only not with updatedb!!
On Wed, Sep 2, 2009 at 4:02 PM, zzeran <[email protected]> wrote: > > Hi Vishal, > > Thanks for your help. > > I've tried dumping the segment like you suggested and indeed, I've got the > following error message: > > SegmentReader: dump segment: /user/nutch/crawl/segments/20090901230006 > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer > ava:63) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1 > ) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193 > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206 > at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile > cordReader.java:76) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map > sk.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j > a:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer > ava:63) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1 > ) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193 > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206 > at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile > cordReader.java:76) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map > sk.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j > a:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer > ava:63) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1 > ) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193 > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206 > at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile > cordReader.java:76) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map > sk.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j > a:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) > at > org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) > at > org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564) > > So according to what you've said - the segment got corrupted. Is this > really > the case? why was it corrupted? anyway I can avoid it? > > Thanks, > Eran > > > vishal vachhani wrote: > > > > I have also seen this exception. However, I am not able to exactly figure > > out why it is happening. But when I dump my segment using "readseg", it > is > > also throwing exception. I suspect that my segment got corrupted. Please > > try to dump you segments and check whether it is getting dumped or not. > > > > let me know if you able to solve the problem. > > > > > > On Wed, Sep 2, 2009 at 2:23 PM, zzeran <[email protected]> wrote: > > > >> > >> Hi, > >> > >> I'm new to Nutch and so far very impressed! > >> > >> I've been investigating Nutch for the past two weeks and now I've > started > >> fetching pages from the internet (I'm doing a specific crawl on a few > >> selected domains). > >> > >> I'm running Nutch using Hadoop on two machines, using the DFS and using > >> Ubuntu 9.04. > >> > >> I've tried running Nutch several times but everytime it seems to be > >> crashing > >> after 4-5 hours of crawling (even after I've formatted the DFS and > >> restarted > >> crawl). > >> > >> I've created a "loop" in a shell script that execute all the crawling > >> phases. On the loop that caused the crash (after 4-5 hours) I'm getting > >> the > >> following: > >> > >> Generator: Selecting best-scoring urls due for fetch. > >> Generator: starting > >> Generator: segment: crawl/segments/20090901230006 > >> Generator: filtering: true > >> Generator: topN: 10000 > >> Generator: Partitioning selected urls by host, for politeness. > >> Generator: done. > >> processing segment /user/nutch/crawl/segments/20090901230006 > >> Fetcher: starting > >> Fetcher: segment: /user/nutch/crawl/segments/20090901230006 > >> Fetcher: done > >> CrawlDb update: starting > >> CrawlDb update: db: crawl/crawldb > >> CrawlDb update: segments: [/user/nutch/crawl/segments/20090901230006] > >> CrawlDb update: additions allowed: true > >> CrawlDb update: URL normalizing: true > >> CrawlDb update: URL filtering: true > >> CrawlDb update: Merging segment data into db. > >> java.io.EOFException > >> at java.io.DataInputStream.readFully(DataInputStream.java:197) > >> at java.io.DataInputStream.readFully(DataInputStream.java:169) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) > >> at org.apache.hadoop.mapred.Child.main(Child.java:158) > >> > >> java.io.EOFException > >> at java.io.DataInputStream.readFully(DataInputStream.java:197) > >> at java.io.DataInputStream.readFully(DataInputStream.java:169) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) > >> at org.apache.hadoop.mapred.Child.main(Child.java:158) > >> > >> java.io.EOFException > >> at java.io.DataInputStream.readFully(DataInputStream.java:197) > >> at java.io.DataInputStream.readFully(DataInputStream.java:169) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) > >> at org.apache.hadoop.mapred.Child.main(Child.java:158) > >> > >> CrawlDb update: java.io.IOException: Job failed! > >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) > >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) > >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) > >> > >> > >> Any ideas? > >> > >> Thanks, > >> Eran > >> -- > >> View this message in context: > >> > http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25253922.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > > Thanks and Regards, > > Vishal Vachhani > > > > > > -- > View this message in context: > http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25255115.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Thanks and Regards, Vishal Vachhani
