Re: Problem with latest SVN during reduce phase

Pashabhai Thu, 12 Jan 2006 22:27:29 -0800

Hi ,

   You are right, Parse object is not null even though
page has no content and title.


   Could it be FetcherOutput Object ???

     
P   

--- Lukas Vlcek <[EMAIL PROTECTED]> wrote:

> Hi,
> I think this issue can be more complex. If I
> remember my test
> correctly then parse object was not null. Also
> parse.getText() was not
> null (it just contained empty String).
> If document is not parsed correctly then "empty"
> parse is returned
> instead: parseStatus.getEmptyParse(); which should
> be OK, but I didn't
> have a chance to check if this can cause any
> troubles during index
> index optimization.
> Lukas
> 
> On 1/12/06, Pashabhai <[EMAIL PROTECTED]>
> wrote:
> > Hi ,
> >
> >    The very similar exception occurs while
> indexing a
> > page which do not have body content (and title
> > sometimes).
> >
> > 051223 194717 Optimizing index.
> > java.lang.NullPointerException
> >         at
> >
>
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> >
> >         at
> >
>
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> >
> >         at
> >
>
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> >
> >         at
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> >
> >         at
> >
> >
> >  Looking into the source code of
> BasicIndexingFilter.
> > it is trying to
> > doc.add(Field.UnStored("content",
> parse.getText()));
> >
> > I guess adding check for null on parse object
> > if(parse!=null)   should solve the problem.
> >
> > Can confirm when tested locally.
> >
> > Thanks
> > P
> >
> >
> >
> >
> > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > > I am facing this error as well. Now I located
> one
> > > particular document
> > > which is causing it (it is msword document which
> > > can't be properly
> > > parsed by parser). I have sent it to Andrzej in
> > > separed email. Let's
> > > see if that helps...
> > > Lukas
> > >
> > > On 1/11/06, Dominik Friedrich
> > > <[EMAIL PROTECTED]> wrote:
> > > > I got this exception a lot, too. I haven't
> tested
> > > the patch by Andrzej
> > > > yet but instead I just put the doc.add() lines
> in
> > > the indexer reduce
> > > > function in a try-catch block . This way the
> > > indexing finishes even with
> > > > a null value and i can see which documents
> haven't
> > > been indexed in the
> > > > log file.
> > > >
> > > > Wouldn't it be a good idea to catch every
> > > exceptions that only affect
> > > > one document in loops like this? At least I
> don't
> > > like it if an indexing
> > > > process dies after a few hours because one
> > > document triggers such an
> > > > exception.
> > > >
> > > > best regards,
> > > > Dominik
> > > >
> > > > Byron Miller wrote:
> > > > > 60111 103432 reduce > reduce
> > > > > 060111 103432 Optimizing index.
> > > > > 060111 103433 closing > reduce
> > > > > 060111 103434 closing > reduce
> > > > > 060111 103435 closing > reduce
> > > > > java.lang.NullPointerException: value cannot
> be
> > > null
> > > > >         at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:469)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:412)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > Exception in thread "main"
> java.io.IOException:
> > > Job
> > > > > failed!
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > > >         at
> > > > >
> > >
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > > [EMAIL PROTECTED]:/data/nutch/trunk$
> > > > >
> > > > >
> > > > > Pulled todays build and got above error. No
> > > problems
> > > > > running out of disk space or anything like
> that.
> > > This
> > > > > is a single instance, local file systems.
> > > > >
> > > > > Anyway to recover the crawl/finish the
> reduce
> > > job from
> > > > > where it failed?
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Problem with latest SVN during reduce phase

Reply via email to