Hello, I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get an error when trying to run DeleteDuplicates directly in Eclipse. The corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also work. When trying to run it with args "crawl1\\indexes". output in hadoop.log is:
2007-03-27 23:14:33,151 INFO indexer.DeleteDuplicates - Dedup: starting 2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: crawl1/indexes 2007-03-27 23:14:33,792 WARN mapred.LocalJobRunner - job_uyjjzt java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550 at org.apache.lucene.util.BitVector.get(BitVector.java:72) at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) 2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates - DeleteDuplicates: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490) Another thing I don't understand is that after crawling nutch claims 551 documents while LUKE states the index has only 473 documents. thanks in advance, Tim Benke ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
