I got this exception a lot, too. I haven't tested the patch by Andrzej
yet but instead I just put the doc.add() lines in the indexer reduce
function in a try-catch block . This way the indexing finishes even with
a null value and i can see which documents haven't been indexed in the
log file.
Wouldn't it be a good idea to catch every exceptions that only affect
one document in loops like this? At least I don't like it if an indexing
process dies after a few hours because one document triggers such an
exception.
best regards,
Dominik
Byron Miller wrote:
60111 103432 reduce > reduce
060111 103432 Optimizing index.
060111 103433 closing > reduce
060111 103434 closing > reduce
060111 103435 closing > reduce
java.lang.NullPointerException: value cannot be null
at
org.apache.lucene.document.Field.<init>(Field.java:469)
at
org.apache.lucene.document.Field.<init>(Field.java:412)
at
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
at
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread "main" java.io.IOException: Job
failed!
at
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
at
org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
[EMAIL PROTECTED]:/data/nutch/trunk$
Pulled todays build and got above error. No problems
running out of disk space or anything like that. This
is a single instance, local file systems.
Anyway to recover the crawl/finish the reduce job from
where it failed?