Hi , You are right, Parse object is not null even though page has no content and title.
Could it be FetcherOutput Object ??? P --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hi, > I think this issue can be more complex. If I > remember my test > correctly then parse object was not null. Also > parse.getText() was not > null (it just contained empty String). > If document is not parsed correctly then "empty" > parse is returned > instead: parseStatus.getEmptyParse(); which should > be OK, but I didn't > have a chance to check if this can cause any > troubles during index > index optimization. > Lukas > > On 1/12/06, Pashabhai <[EMAIL PROTECTED]> > wrote: > > Hi , > > > > The very similar exception occurs while > indexing a > > page which do not have body content (and title > > sometimes). > > > > 051223 194717 Optimizing index. > > java.lang.NullPointerException > > at > > > org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) > > > > at > > > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) > > > > at > > > org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) > > > > at > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > > at > > > > > > Looking into the source code of > BasicIndexingFilter. > > it is trying to > > doc.add(Field.UnStored("content", > parse.getText())); > > > > I guess adding check for null on parse object > > if(parse!=null) should solve the problem. > > > > Can confirm when tested locally. > > > > Thanks > > P > > > > > > > > > > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > I am facing this error as well. Now I located > one > > > particular document > > > which is causing it (it is msword document which > > > can't be properly > > > parsed by parser). I have sent it to Andrzej in > > > separed email. Let's > > > see if that helps... > > > Lukas > > > > > > On 1/11/06, Dominik Friedrich > > > <[EMAIL PROTECTED]> wrote: > > > > I got this exception a lot, too. I haven't > tested > > > the patch by Andrzej > > > > yet but instead I just put the doc.add() lines > in > > > the indexer reduce > > > > function in a try-catch block . This way the > > > indexing finishes even with > > > > a null value and i can see which documents > haven't > > > been indexed in the > > > > log file. > > > > > > > > Wouldn't it be a good idea to catch every > > > exceptions that only affect > > > > one document in loops like this? At least I > don't > > > like it if an indexing > > > > process dies after a few hours because one > > > document triggers such an > > > > exception. > > > > > > > > best regards, > > > > Dominik > > > > > > > > Byron Miller wrote: > > > > > 60111 103432 reduce > reduce > > > > > 060111 103432 Optimizing index. > > > > > 060111 103433 closing > reduce > > > > > 060111 103434 closing > reduce > > > > > 060111 103435 closing > reduce > > > > > java.lang.NullPointerException: value cannot > be > > > null > > > > > at > > > > > > > > > > > org.apache.lucene.document.Field.<init>(Field.java:469) > > > > > at > > > > > > > > > > > org.apache.lucene.document.Field.<init>(Field.java:412) > > > > > at > > > > > > > > > > > org.apache.lucene.document.Field.UnIndexed(Field.java:195) > > > > > at > > > > > > > > > > > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) > > > > > at > > > > > > > > > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > > > at > > > > > > > > > > > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) > > > > > Exception in thread "main" > java.io.IOException: > > > Job > > > > > failed! > > > > > at > > > > > > > > > > > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > > > > > at > > > > > > > > > > > org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > > > > > at > > > > > > > > > org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > > > > [EMAIL PROTECTED]:/data/nutch/trunk$ > > > > > > > > > > > > > > > Pulled todays build and got above error. No > > > problems > > > > > running out of disk space or anything like > that. > > > This > > > > > is a single instance, local file systems. > > > > > > > > > > Anyway to recover the crawl/finish the > reduce > > > job from > > > > > where it failed? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com