Hello,
We are using Nutch 0.9 on a Linux server to index websites of a specific top
level domain, unfortunately since a few days the dedup phase (running just
after the index phase) doesn't work anymore. We didn't change anything and
don't really understand why this happens right now. The only thing we can see
in the hadoop logfile are the following lines:
2009-04-09 22:57:17,084 INFO indexer.DeleteDuplicates - Dedup: starting
2009-04-09 22:57:17,152 INFO indexer.DeleteDuplicates - Dedup: adding indexes
in: /mnt/crawl_new/NEWindexes
2009-04-09 23:07:57,376 WARN mapred.LocalJobRunner - job_ocfvcm
java.io.IOException: Lock obtain timed out:
l...@file:/mnt/crawl_new/NEWindexes/part-00000/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:69)
at
org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526)
at
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551)
at
org.apache.nutch.indexer.DeleteDuplicates.reduce(DeleteDuplicates.java:378)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
2009-04-09 23:07:57,445 FATAL indexer.DeleteDuplicates - DeleteDuplicates:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:482)
at
org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)
Does anyone have an idea what is going wrong here ?
Many thanks for the help
Regards