Hi , The very similar exception occurs while indexing a page which do not have body content (and title sometimes).
051223 194717 Optimizing index. java.lang.NullPointerException at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) at org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at Looking into the source code of BasicIndexingFilter. it is trying to doc.add(Field.UnStored("content", parse.getText())); I guess adding check for null on parse object if(parse!=null) should solve the problem. Can confirm when tested locally. Thanks P --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hi, > I am facing this error as well. Now I located one > particular document > which is causing it (it is msword document which > can't be properly > parsed by parser). I have sent it to Andrzej in > separed email. Let's > see if that helps... > Lukas > > On 1/11/06, Dominik Friedrich > <[EMAIL PROTECTED]> wrote: > > I got this exception a lot, too. I haven't tested > the patch by Andrzej > > yet but instead I just put the doc.add() lines in > the indexer reduce > > function in a try-catch block . This way the > indexing finishes even with > > a null value and i can see which documents haven't > been indexed in the > > log file. > > > > Wouldn't it be a good idea to catch every > exceptions that only affect > > one document in loops like this? At least I don't > like it if an indexing > > process dies after a few hours because one > document triggers such an > > exception. > > > > best regards, > > Dominik > > > > Byron Miller wrote: > > > 60111 103432 reduce > reduce > > > 060111 103432 Optimizing index. > > > 060111 103433 closing > reduce > > > 060111 103434 closing > reduce > > > 060111 103435 closing > reduce > > > java.lang.NullPointerException: value cannot be > null > > > at > > > > org.apache.lucene.document.Field.<init>(Field.java:469) > > > at > > > > org.apache.lucene.document.Field.<init>(Field.java:412) > > > at > > > > org.apache.lucene.document.Field.UnIndexed(Field.java:195) > > > at > > > > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) > > > at > > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > at > > > > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) > > > Exception in thread "main" java.io.IOException: > Job > > > failed! > > > at > > > > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > > > at > > > > org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > > > at > > > > org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > > [EMAIL PROTECTED]:/data/nutch/trunk$ > > > > > > > > > Pulled todays build and got above error. No > problems > > > running out of disk space or anything like that. > This > > > is a single instance, local file systems. > > > > > > Anyway to recover the crawl/finish the reduce > job from > > > where it failed? > > > > > > > > > > > > > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com