Hello,

We are using Nutch 0.9 on a Linux server to index websites of a specific top 
level domain, unfortunately since a few days the dedup phase (running just 
after the index phase) doesn't work anymore. We didn't change anything and 
don't really understand why this happens right now. The only thing we can see 
in the hadoop logfile are the following lines:

2009-04-09 22:57:17,084 INFO  indexer.DeleteDuplicates - Dedup: starting
2009-04-09 22:57:17,152 INFO  indexer.DeleteDuplicates - Dedup: adding indexes 
in: /mnt/crawl_new/NEWindexes
2009-04-09 23:07:57,376 WARN  mapred.LocalJobRunner - job_ocfvcm
java.io.IOException: Lock obtain timed out: 
l...@file:/mnt/crawl_new/NEWindexes/part-00000/write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:69)
        at 
org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526)
        at 
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551)
        at 
org.apache.nutch.indexer.DeleteDuplicates.reduce(DeleteDuplicates.java:378)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
2009-04-09 23:07:57,445 FATAL indexer.DeleteDuplicates - DeleteDuplicates: 
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:482)
        at 
org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at 
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)

Does anyone have an idea what is going wrong here ?

Many thanks for the help
Regards


      

Reply via email to