Hi ,

   The very similar exception occurs while indexing a
page which do not have body content (and title
sometimes). 

051223 194717 Optimizing index. 
java.lang.NullPointerException 
        at 
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)

        at 
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)

        at 
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)

        at 
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)

        at 


 Looking into the source code of BasicIndexingFilter.
it is trying to 
doc.add(Field.UnStored("content", parse.getText()));
 
I guess adding check for null on parse object
if(parse!=null)   should solve the problem.

Can confirm when tested locally.

Thanks
P




--- Lukas Vlcek <[EMAIL PROTECTED]> wrote:

> Hi,
> I am facing this error as well. Now I located one
> particular document
> which is causing it (it is msword document which
> can't be properly
> parsed by parser). I have sent it to Andrzej in
> separed email. Let's
> see if that helps...
> Lukas
> 
> On 1/11/06, Dominik Friedrich
> <[EMAIL PROTECTED]> wrote:
> > I got this exception a lot, too. I haven't tested
> the patch by Andrzej
> > yet but instead I just put the doc.add() lines in
> the indexer reduce
> > function in a try-catch block . This way the
> indexing finishes even with
> > a null value and i can see which documents haven't
> been indexed in the
> > log file.
> >
> > Wouldn't it be a good idea to catch every
> exceptions that only affect
> > one document in loops like this? At least I don't
> like it if an indexing
> > process dies after a few hours because one
> document triggers such an
> > exception.
> >
> > best regards,
> > Dominik
> >
> > Byron Miller wrote:
> > > 60111 103432 reduce > reduce
> > > 060111 103432 Optimizing index.
> > > 060111 103433 closing > reduce
> > > 060111 103434 closing > reduce
> > > 060111 103435 closing > reduce
> > > java.lang.NullPointerException: value cannot be
> null
> > >         at
> > >
>
org.apache.lucene.document.Field.<init>(Field.java:469)
> > >         at
> > >
>
org.apache.lucene.document.Field.<init>(Field.java:412)
> > >         at
> > >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > >         at
> > >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > >         at
> > >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > >         at
> > >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > Exception in thread "main" java.io.IOException:
> Job
> > > failed!
> > >         at
> > >
>
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > >         at
> > >
>
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > >         at
> > >
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > [EMAIL PROTECTED]:/data/nutch/trunk$
> > >
> > >
> > > Pulled todays build and got above error. No
> problems
> > > running out of disk space or anything like that.
> This
> > > is a single instance, local file systems.
> > >
> > > Anyway to recover the crawl/finish the reduce
> job from
> > > where it failed?
> > >
> > >
> > >
> >
> >
> >
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to