I guess the problem lies in the Configuration which I create with NutchConfiguration.create() because Nutch uses the DeleteDuplicates class on indices anyway after finishing a crawl right? What is really odd to me is that the number of documents reportet by LUKE 0.7 and at the end of the crawl of Nutch-nightly differs. I am refering to the number of documents merged at the end of each crawl.. Has anybody an idea what could cause this inconsistence?
Tim Benke wrote: > Hello, > > I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get > an error when trying to run DeleteDuplicates directly in Eclipse. The > corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also > work. When trying to run it with args "crawl1\\indexes". output in > hadoop.log is: > > 2007-03-27 23:14:33,151 INFO indexer.DeleteDuplicates - Dedup: starting > 2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup: adding > indexes in: crawl1/indexes > 2007-03-27 23:14:33,792 WARN mapred.LocalJobRunner - job_uyjjzt > java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550 > at org.apache.lucene.util.BitVector.get(BitVector.java:72) > at > org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) > at > org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) > > > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > 2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates - > DeleteDuplicates: java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) > > at > org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at > org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490) > > Another thing I don't understand is that after crawling nutch claims > 551 documents while LUKE states the index has only 473 documents. > > thanks in advance, > > Tim Benke ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
