I guess the problem lies in the Configuration which I create with 
NutchConfiguration.create() because Nutch uses the DeleteDuplicates 
class on indices anyway after finishing a crawl right?
What is really odd to me is that the number of documents reportet by 
LUKE 0.7 and at the end of the crawl of Nutch-nightly differs. I am 
refering to the number of documents merged at the end of each crawl..
Has anybody an idea what could cause this inconsistence?

Tim Benke wrote:
> Hello,
>
> I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get 
> an error when trying to run DeleteDuplicates directly in Eclipse. The 
> corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also 
> work. When trying to run it with args "crawl1\\indexes". output in 
> hadoop.log is:
>
> 2007-03-27 23:14:33,151 INFO  indexer.DeleteDuplicates - Dedup: starting
> 2007-03-27 23:14:33,198 INFO  indexer.DeleteDuplicates - Dedup: adding 
> indexes in: crawl1/indexes
> 2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner - job_uyjjzt
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550
>   at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>   at 
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
>   at 
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
>  
>
>   at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> 2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates - 
> DeleteDuplicates: java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>   at 
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) 
>
>   at 
> org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
>   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>   at 
> org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)
>
> Another thing I don't understand is that after crawling nutch claims 
> 551 documents while LUKE states the index has only 473 documents.
>
> thanks in advance,
>
> Tim Benke


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to