Re: Problem with latest SVN during reduce phase
Hi, I am facing this error as well. Now I located one particular document which is causing it (it is msword document which can't be properly parsed by parser). I have sent it to Andrzej in separed email. Let's see if that helps... Lukas On 1/11/06, Dominik Friedrich <[EMAIL PROTECTED]> wrote: > I got this exception a lot, too. I haven't tested the patch by Andrzej > yet but instead I just put the doc.add() lines in the indexer reduce > function in a try-catch block . This way the indexing finishes even with > a null value and i can see which documents haven't been indexed in the > log file. > > Wouldn't it be a good idea to catch every exceptions that only affect > one document in loops like this? At least I don't like it if an indexing > process dies after a few hours because one document triggers such an > exception. > > best regards, > Dominik > > Byron Miller wrote: > > 60111 103432 reduce > reduce > > 060111 103432 Optimizing index. > > 060111 103433 closing > reduce > > 060111 103434 closing > reduce > > 060111 103435 closing > reduce > > java.lang.NullPointerException: value cannot be null > > at > > org.apache.lucene.document.Field.(Field.java:469) > > at > > org.apache.lucene.document.Field.(Field.java:412) > > at > > org.apache.lucene.document.Field.UnIndexed(Field.java:195) > > at > > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) > > at > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > at > > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) > > Exception in thread "main" java.io.IOException: Job > > failed! > > at > > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > > at > > org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > > at > > org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > [EMAIL PROTECTED]:/data/nutch/trunk$ > > > > > > Pulled todays build and got above error. No problems > > running out of disk space or anything like that. This > > is a single instance, local file systems. > > > > Anyway to recover the crawl/finish the reduce job from > > where it failed? > > > > > > > > >
Re: Problem with latest SVN during reduce phase
Hi , The very similar exception occurs while indexing a page which do not have body content (and title sometimes). 051223 194717 Optimizing index. java.lang.NullPointerException at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) at org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at Looking into the source code of BasicIndexingFilter. it is trying to doc.add(Field.UnStored("content", parse.getText())); I guess adding check for null on parse object if(parse!=null) should solve the problem. Can confirm when tested locally. Thanks P --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hi, > I am facing this error as well. Now I located one > particular document > which is causing it (it is msword document which > can't be properly > parsed by parser). I have sent it to Andrzej in > separed email. Let's > see if that helps... > Lukas > > On 1/11/06, Dominik Friedrich > <[EMAIL PROTECTED]> wrote: > > I got this exception a lot, too. I haven't tested > the patch by Andrzej > > yet but instead I just put the doc.add() lines in > the indexer reduce > > function in a try-catch block . This way the > indexing finishes even with > > a null value and i can see which documents haven't > been indexed in the > > log file. > > > > Wouldn't it be a good idea to catch every > exceptions that only affect > > one document in loops like this? At least I don't > like it if an indexing > > process dies after a few hours because one > document triggers such an > > exception. > > > > best regards, > > Dominik > > > > Byron Miller wrote: > > > 60111 103432 reduce > reduce > > > 060111 103432 Optimizing index. > > > 060111 103433 closing > reduce > > > 060111 103434 closing > reduce > > > 060111 103435 closing > reduce > > > java.lang.NullPointerException: value cannot be > null > > > at > > > > org.apache.lucene.document.Field.(Field.java:469) > > > at > > > > org.apache.lucene.document.Field.(Field.java:412) > > > at > > > > org.apache.lucene.document.Field.UnIndexed(Field.java:195) > > > at > > > > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) > > > at > > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > at > > > > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) > > > Exception in thread "main" java.io.IOException: > Job > > > failed! > > > at > > > > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > > > at > > > > org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > > > at > > > > org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > > [EMAIL PROTECTED]:/data/nutch/trunk$ > > > > > > > > > Pulled todays build and got above error. No > problems > > > running out of disk space or anything like that. > This > > > is a single instance, local file systems. > > > > > > Anyway to recover the crawl/finish the reduce > job from > > > where it failed? > > > > > > > > > > > > > > > > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Problem with latest SVN during reduce phase
Hi, I think this issue can be more complex. If I remember my test correctly then parse object was not null. Also parse.getText() was not null (it just contained empty String). If document is not parsed correctly then "empty" parse is returned instead: parseStatus.getEmptyParse(); which should be OK, but I didn't have a chance to check if this can cause any troubles during index index optimization. Lukas On 1/12/06, Pashabhai <[EMAIL PROTECTED]> wrote: > Hi , > >The very similar exception occurs while indexing a > page which do not have body content (and title > sometimes). > > 051223 194717 Optimizing index. > java.lang.NullPointerException > at > org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) > > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) > > at > org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) > > at > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > at > > > Looking into the source code of BasicIndexingFilter. > it is trying to > doc.add(Field.UnStored("content", parse.getText())); > > I guess adding check for null on parse object > if(parse!=null) should solve the problem. > > Can confirm when tested locally. > > Thanks > P > > > > > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > Hi, > > I am facing this error as well. Now I located one > > particular document > > which is causing it (it is msword document which > > can't be properly > > parsed by parser). I have sent it to Andrzej in > > separed email. Let's > > see if that helps... > > Lukas > > > > On 1/11/06, Dominik Friedrich > > <[EMAIL PROTECTED]> wrote: > > > I got this exception a lot, too. I haven't tested > > the patch by Andrzej > > > yet but instead I just put the doc.add() lines in > > the indexer reduce > > > function in a try-catch block . This way the > > indexing finishes even with > > > a null value and i can see which documents haven't > > been indexed in the > > > log file. > > > > > > Wouldn't it be a good idea to catch every > > exceptions that only affect > > > one document in loops like this? At least I don't > > like it if an indexing > > > process dies after a few hours because one > > document triggers such an > > > exception. > > > > > > best regards, > > > Dominik > > > > > > Byron Miller wrote: > > > > 60111 103432 reduce > reduce > > > > 060111 103432 Optimizing index. > > > > 060111 103433 closing > reduce > > > > 060111 103434 closing > reduce > > > > 060111 103435 closing > reduce > > > > java.lang.NullPointerException: value cannot be > > null > > > > at > > > > > > > org.apache.lucene.document.Field.(Field.java:469) > > > > at > > > > > > > org.apache.lucene.document.Field.(Field.java:412) > > > > at > > > > > > > org.apache.lucene.document.Field.UnIndexed(Field.java:195) > > > > at > > > > > > > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) > > > > at > > > > > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > > at > > > > > > > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) > > > > Exception in thread "main" java.io.IOException: > > Job > > > > failed! > > > > at > > > > > > > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > > > > at > > > > > > > org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > > > > at > > > > > > org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > > > [EMAIL PROTECTED]:/data/nutch/trunk$ > > > > > > > > > > > > Pulled todays build and got above error. No > > problems > > > > running out of disk space or anything like that. > > This > > > > is a single instance, local file systems. > > > > > > > > Anyway to recover the crawl/finish the reduce > > job from > > > > where it failed? > > > > > > > > > > > > > > > > > > > > > > > > > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com >
Speed up searching
Dear Developers, I think this great improvement is missing from latest Nutch/Lucene nightly build: http://issues.apache.org/jira/browse/LUCENE-443 Best Regards, Ferenc
NutchQuery adding non required Terms
Hi, I would love to build a nutch Query object via API and not using the Queryparser. In my case I need the complete set of boolean operators in the query, so required (AND) and non required (OR) terms and prohibited (NOT). I notice that in general this would be possible to add a clause in the Query object, since the BasicQuery filter just copies the parameter isRequired and isProhibited. However the Clauses arraylist is private and there is not method in the nutch query object that allows to add custom terms or clauses with isRequired and isProhibited. Did I miss something in general to be able to support non required terms in nutch? Would people agree to add a little method that allows to adding terms with these parameters? Thanks for any comments. Stefan
Re: NutchQuery adding non required Terms
Stefan Groschupf wrote: Did I miss something in general to be able to support non required terms in nutch? I left OR and nesting out of the API to simplify what query filters have to process. Nutch's query features are approximately what Google supported for its first three years. (Google did not add OR until 2000, I think.) If we permit optional clauses then we need to make sure that each query filter can handle them correctly. For example, the query "+A +B" is translated by query-basic into something like: +(title:a OR content:a OR anchors:a OR url:a OR host:a) +(title:b OR content:b OR anchors:b OR url:b OR host:b) title:"a b"~999 content:"a b"~999 anchors:"a b"~999 url:"a b"~999 host:"a b"~999 The query "+A B" (where B is optional) should remove the plus in the second line above. So it should not be too hard to change query-basic to be able to handle optional terms in the default field. Perhaps that's the only query filter that would need to be updated. And it looks like LuceneQueryOptimizer already checks that filterized clauses are required. It would be good to have some unit tests for query filtering. Doug
quit the maillist
Hi, May I quit the nutch-dev mailing list? Thank you! Sue
MapReduce and segment merging
Is it possible to merge segments in the map reduce version of Nutch? - Yahoo! Photos Showcase holiday pictures in hardcover Photo Books. You design it and well bind it!
Re: MapReduce and segment merging
Mike Alulin wrote: Is it possible to merge segments in the map reduce version of Nutch? Not yet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: MapReduce and segment merging
Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case? Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them. Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Mike Alulin wrote: > Is it possible to merge segments in the map reduce version of Nutch? > Not yet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - Yahoo! Photos Got holiday prints? See all the ways to get quality prints in your hands ASAP.
Where is org.apache.nutch.protocol.http.api.HttpBase?
Hi Guys I update the source code from svn head version now. However I cannot find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss it? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: MapReduce and segment merging
Mike Alulin wrote: Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case? Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them. This is a development version, nobody said it's feature complete. Patience, my friend... or spend some effort to improve it. ;-) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-172) Segment merger
Segment merger -- Key: NUTCH-172 URL: http://issues.apache.org/jira/browse/NUTCH-172 Project: Nutch Type: New Feature Versions: 0.8-dev Environment: Any Reporter: Mike Alulin The map reduce version missing segment merging that can be very important when one wants to have frequent crawls of updated pages only. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: MapReduce and segment merging
Could you also just copy segments out of NDFS to local -- perform merges in local -- then copy segments back into NDFS? DaveG -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, January 12, 2006 2:14 PM To: nutch-dev@lucene.apache.org Subject: Re: MapReduce and segment merging Mike Alulin wrote: > Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case? > > Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them. > This is a development version, nobody said it's feature complete. Patience, my friend... or spend some effort to improve it. ;-) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Where is org.apache.nutch.protocol.http.api.HttpBase?
I guess it is in: src/plugin/lib-http/ Am 12.01.2006 um 18:06 schrieb Jack Tang: Hi Guys I update the source code from svn head version now. However I cannot find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss it? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
RE: MapReduce and segment merging
I was thinking that Nutch needs some sort of workflow manager. This way you could build jobs off specific workflows and hopefully recover jobs based upon the portion of the workflow they are stuck. (or restart a job if failed/processing time > x hours and other such workflow processes rules) Something like that could also send notifications of jobs done, trigger other events and create a management interface to what your cluster is up to or apply configuration types to be defigned based upon batch job/workflow process "in process". For example if i'm building a blog index i may want more smaller segments based upon daily fetches while for other jobs i may want less larger segments. Does something like that make much sense for where mapred branch is going? is workflow the right term for such beast? -byron --- "Goldschmidt, Dave" <[EMAIL PROTECTED]> wrote: > Could you also just copy segments out of NDFS to > local -- perform merges > in local -- then copy segments back into NDFS? > > DaveG > > > -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 12, 2006 2:14 PM > To: nutch-dev@lucene.apache.org > Subject: Re: MapReduce and segment merging > > Mike Alulin wrote: > > Then how people uses the new version if they need > let's say daily > crawls of the new/updated pages? I crawl updated > pages every 24 hours > and if I do not merge the segments, soon I will have > hundreds of them. > What is the best solution in this case? > > > > Full recrawl is not a good option as i have > millions of documents > and I DO know which of them were updated without > requesting them. > > > > This is a development version, nobody said it's > feature complete. > Patience, my friend... or spend some effort to > improve it. ;-) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > >
Re: NutchQuery adding non required Terms
Thanks for the hint. I would love to add non required terms and nesting to the Query object API, I will provide also some unit tests, but since I'm not a javacc geek it will only extend the java api not the query parser. Would such a extension be welcome? Stefan Am 12.01.2006 um 18:29 schrieb Doug Cutting: Stefan Groschupf wrote: Did I miss something in general to be able to support non required terms in nutch? I left OR and nesting out of the API to simplify what query filters have to process. Nutch's query features are approximately what Google supported for its first three years. (Google did not add OR until 2000, I think.) If we permit optional clauses then we need to make sure that each query filter can handle them correctly. For example, the query "+A +B" is translated by query-basic into something like: +(title:a OR content:a OR anchors:a OR url:a OR host:a) +(title:b OR content:b OR anchors:b OR url:b OR host:b) title:"a b"~999 content:"a b"~999 anchors:"a b"~999 url:"a b"~999 host:"a b"~999 The query "+A B" (where B is optional) should remove the plus in the second line above. So it should not be too hard to change query-basic to be able to handle optional terms in the default field. Perhaps that's the only query filter that would need to be updated. And it looks like LuceneQueryOptimizer already checks that filterized clauses are required. It would be good to have some unit tests for query filtering. Doug --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: NutchQuery adding non required Terms
Stefan Groschupf wrote: I would love to add non required terms and nesting to the Query object API, I will provide also some unit tests, but since I'm not a javacc geek it will only extend the java api not the query parser. Would such a extension be welcome? I think we should start with just adding non-required terms, and leave nesting as a subsequent step. I also agree that we can leave this out of the query parser as a start. Doug
[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Attachment: build.xml.patch urlfilter-whitelist.tar.gz THIS REPLACES THE PREVIOUS TARBALL SEE THE INCLUDED README.txt FOR USAGE GUIDELINES Place both of these files into ~nutch/src/plugin, then: - untar the tarball - apply the patch to ~nutch/src/plugin/build.xml to permit urifilter-whitelist to be built Next, cd ~nutch and build ("ant"). A JUnit test is included. It will be run automatically by "ant test-plugins". Then follow the instructions in ~nutch/src/plugin/urlfilter-whitelist/README.txt > Efficient site-specific crawling for a large number of sites > > > Key: NUTCH-87 > URL: http://issues.apache.org/jira/browse/NUTCH-87 > Project: Nutch > Type: New Feature > Components: fetcher > Environment: cross-platform > Reporter: AJ Chen > Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, > urlfilter-whitelist.tar.gz > > There is a gap between whole-web crawling and single (or handful) site > crawling. Many applications actually fall in this gap, which usually require > to crawl a large number of selected sites, say 10 domains. Current > CrawlTool is designed for a handful of sites. So, this request calls for a > new feature or improvement on CrawTool so that "nutch crawl" command can > efficiently deal with large number of sites. One requirement is to add or > change smallest amount of code so that this feature can be implemented sooner > rather than later. > There is a discussion about adding a URLFilter to implement this requested > feature, see the following thread - > http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html > The idea is to use a hashtable in URLFilter for looking up regex for any > given domain. Hashtable will be much faster than list implementation > currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented > such idea before for his own application and is willing to make it available > for adaptation to Nutch. I'll be happy to help him in this regard. > But, before we do it, we would like to hear more discussions or comments > about this approach or other approaches. Particularly, let us know what > potential downside will be for hashtable lookup in a new URLFilter plugin. > AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ] Matt Kangas commented on NUTCH-87: -- JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file instead. > Efficient site-specific crawling for a large number of sites > > > Key: NUTCH-87 > URL: http://issues.apache.org/jira/browse/NUTCH-87 > Project: Nutch > Type: New Feature > Components: fetcher > Environment: cross-platform > Reporter: AJ Chen > Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, > urlfilter-whitelist.tar.gz > > There is a gap between whole-web crawling and single (or handful) site > crawling. Many applications actually fall in this gap, which usually require > to crawl a large number of selected sites, say 10 domains. Current > CrawlTool is designed for a handful of sites. So, this request calls for a > new feature or improvement on CrawTool so that "nutch crawl" command can > efficiently deal with large number of sites. One requirement is to add or > change smallest amount of code so that this feature can be implemented sooner > rather than later. > There is a discussion about adding a URLFilter to implement this requested > feature, see the following thread - > http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html > The idea is to use a hashtable in URLFilter for looking up regex for any > given domain. Hashtable will be much faster than list implementation > currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented > such idea before for his own application and is willing to make it available > for adaptation to Nutch. I'll be happy to help him in this regard. > But, before we do it, we would like to hear more discussions or comments > about this approach or other approaches. Particularly, let us know what > potential downside will be for hashtable lookup in a new URLFilter plugin. > AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Nutch/Lucene Document Model
Hi all, I just get my hand dirty in nutch recently, especially in extending its functionalities. I learned that nutch/lucene have their document retrieval model implmented in TF vector-based approach. I wonder if there exist of other document model like fuzzy set or probabilistic model implemented in nutch/lucene. The objective of proposing and having a number of document models implemented is to enable us further improve the document ranking in nutch. Please understand that I not questioning the current nutch document ranking efficiency. I just like to see more options in nutch especially how document being modelled and how well they are. It is worth while to move in this direction? Please comment. Bong Chih How
Re: Problem with latest SVN during reduce phase
Hi , You are right, Parse object is not null even though page has no content and title. Could it be FetcherOutput Object ??? P --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hi, > I think this issue can be more complex. If I > remember my test > correctly then parse object was not null. Also > parse.getText() was not > null (it just contained empty String). > If document is not parsed correctly then "empty" > parse is returned > instead: parseStatus.getEmptyParse(); which should > be OK, but I didn't > have a chance to check if this can cause any > troubles during index > index optimization. > Lukas > > On 1/12/06, Pashabhai <[EMAIL PROTECTED]> > wrote: > > Hi , > > > >The very similar exception occurs while > indexing a > > page which do not have body content (and title > > sometimes). > > > > 051223 194717 Optimizing index. > > java.lang.NullPointerException > > at > > > org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75) > > > > at > > > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63) > > > > at > > > org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217) > > > > at > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > > at > > > > > > Looking into the source code of > BasicIndexingFilter. > > it is trying to > > doc.add(Field.UnStored("content", > parse.getText())); > > > > I guess adding check for null on parse object > > if(parse!=null) should solve the problem. > > > > Can confirm when tested locally. > > > > Thanks > > P > > > > > > > > > > --- Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > I am facing this error as well. Now I located > one > > > particular document > > > which is causing it (it is msword document which > > > can't be properly > > > parsed by parser). I have sent it to Andrzej in > > > separed email. Let's > > > see if that helps... > > > Lukas > > > > > > On 1/11/06, Dominik Friedrich > > > <[EMAIL PROTECTED]> wrote: > > > > I got this exception a lot, too. I haven't > tested > > > the patch by Andrzej > > > > yet but instead I just put the doc.add() lines > in > > > the indexer reduce > > > > function in a try-catch block . This way the > > > indexing finishes even with > > > > a null value and i can see which documents > haven't > > > been indexed in the > > > > log file. > > > > > > > > Wouldn't it be a good idea to catch every > > > exceptions that only affect > > > > one document in loops like this? At least I > don't > > > like it if an indexing > > > > process dies after a few hours because one > > > document triggers such an > > > > exception. > > > > > > > > best regards, > > > > Dominik > > > > > > > > Byron Miller wrote: > > > > > 60111 103432 reduce > reduce > > > > > 060111 103432 Optimizing index. > > > > > 060111 103433 closing > reduce > > > > > 060111 103434 closing > reduce > > > > > 060111 103435 closing > reduce > > > > > java.lang.NullPointerException: value cannot > be > > > null > > > > > at > > > > > > > > > > > org.apache.lucene.document.Field.(Field.java:469) > > > > > at > > > > > > > > > > > org.apache.lucene.document.Field.(Field.java:412) > > > > > at > > > > > > > > > > > org.apache.lucene.document.Field.UnIndexed(Field.java:195) > > > > > at > > > > > > > > > > > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) > > > > > at > > > > > > > > > > > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) > > > > > at > > > > > > > > > > > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) > > > > > Exception in thread "main" > java.io.IOException: > > > Job > > > > > failed! > > > > > at > > > > > > > > > > > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > > > > > at > > > > > > > > > > > org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > > > > > at > > > > > > > > > org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > > > > [EMAIL PROTECTED]:/data/nutch/trunk$ > > > > > > > > > > > > > > > Pulled todays build and got above error. No > > > problems > > > > > running out of disk space or anything like > that. > > > This > > > > > is a single instance, local file systems. > > > > > > > > > > Anyway to recover the crawl/finish the > reduce > > > job from > > > > > where it failed? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > __ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
java.io.EOFException ... at org.apache.nutch.ndfs.DataNode$DataXceiver.run...
Hi, I am running mapreduce with 3 machines: one name node and two datanodes. I am using the latest revision of nutch 0.8, revision number 368582, and java version jdk1.5.0_06 I tried a very simple thing on all the three machines: move file from local to ndfs : bin/nutch ndfs -put tmp /user/rafi/tmp10 and I got the same comment from all the machines: 060113 011422 Recovered from failed datanode connection looking at the log files of the data nodes I got the next message: 060112 212301 39 DataXCeiver java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readLong(DataInputStream.java:380) at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:432) at java.lang.Thread.run(Thread.java:595) Am I missing somting? Thanks, Rafi _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/