the last few lines of hadoop.log: 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 10 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 13 more Caused by: java.lang.NullPointerException at java.io.Reader.<init>(Reader.java:78) at java.io.BufferedReader.<init>(BufferedReader.java:94) at java.io.BufferedReader.<init>(BufferedReader.java:109) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57) at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95) ... 18 more 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > You should provide the log output. > > -----Original message----- > > From:Joe Zhang <smartag...@gmail.com> > > Sent: Sun 25-Nov-2012 17:27 > > To: user@nutch.apache.org > > Subject: Re: Indexing-time URL filtering again > > > > I actually checked out the most recent build from SVN, Release 1.6 - > > 23/11/2012. > > > > The following command > > > > bin/nutch solrindex -Durlfilter.regex.file=.....UrlFiltering.txt > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/ > > crawl/segments/* -filter > > > > produced the following output: > > > > SolrIndexer: starting at 2012-11-25 16:19:29 > > SolrIndexer: deleting gone documents: false > > SolrIndexer: URL filtering: true > > SolrIndexer: URL normalizing: false > > java.io.IOException: Job failed! > > > > Can anybody help? > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang <smartag...@gmail.com> wrote: > > > > > How exactly do I get to trunk? > > > > > > I did download download NUTCH-1300-1.5-1.patch, and run the patch > command > > > correctly, and re-build nutch. But the problem still persists... > > > > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma < > markus.jel...@openindex.io > > > > wrote: > > > > > >> No, this is no bug. As i said, you need either to patch your Nutch or > get > > >> the sources from trunk. The -filter parameter is not in your version. > Check > > >> the patch manual if you don't know how it works. > > >> > > >> $ cd trunk ; patch -p0 < file.patch > > >> > > >> -----Original message----- > > >> > From:Joe Zhang <smartag...@gmail.com> > > >> > Sent: Sun 25-Nov-2012 08:42 > > >> > To: Markus Jelsma <markus.jel...@openindex.io>; user < > > >> user@nutch.apache.org> > > >> > Subject: Re: Indexing-time URL filtering again > > >> > > > >> > This does seem a bug. Can anybody help? > > >> > > > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <smartag...@gmail.com> > > >> wrote: > > >> > > > >> > > Markus, could you advise? Thanks a lot! > > >> > > > > >> > > > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <smartag...@gmail.com > > > > >> wrote: > > >> > > > > >> > >> I followed your instruction and applied the patch, Markus, but > the > > >> > >> problem still persists --- "-filter" is interpreted as a path by > > >> solrindex. > > >> > >> > > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma < > > >> > >> markus.jel...@openindex.io> wrote: > > >> > >> > > >> > >>> Ah, i get it now. Please use trunk or patch your version with: > > >> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable > > >> filtering. > > >> > >>> > > >> > >>> -----Original message----- > > >> > >>> > From:Joe Zhang <smartag...@gmail.com> > > >> > >>> > Sent: Fri 23-Nov-2012 03:08 > > >> > >>> > To: user@nutch.apache.org > > >> > >>> > Subject: Re: Indexing-time URL filtering again > > >> > >>> > > > >> > >>> > But Markus said it worked for him. I was really he could send > his > > >> > >>> command > > >> > >>> > line. > > >> > >>> > > > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney < > > >> > >>> > lewis.mcgibb...@gmail.com> wrote: > > >> > >>> > > > >> > >>> > > Is this a bug? > > >> > >>> > > > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang < > > >> smartag...@gmail.com> > > >> > >>> wrote: > > >> > >>> > > > Putting -filter between crawldb and segments, I sitll got > the > > >> same > > >> > >>> thing: > > >> > >>> > > > > > >> > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path > > >> does not > > >> > >>> > > exist: > > >> > >>> > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch > > >> > >>> > > > Input path does not exist: > > >> > >>> > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse > > >> > >>> > > > Input path does not exist: > > >> > >>> > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data > > >> > >>> > > > Input path does not exist: > > >> > >>> > > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text > > >> > >>> > > > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma > > >> > >>> > > > <markus.jel...@openindex.io>wrote: > > >> > >>> > > > > > >> > >>> > > >> These are roughly the available parameters: > > >> > >>> > > >> > > >> > >>> > > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb > <linkdb>] > > >> > >>> [-hostdb > > >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir > > >> > >>> <segments>) > > >> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] > > >> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize] > > >> > >>> > > >> > > >> > >>> > > >> Having -filter at the end should work fine, if it, for > some > > >> > >>> reason, > > >> > >>> > > >> doesn't work put it before the segment and after the > crawldb > > >> and > > >> > >>> file an > > >> > >>> > > >> issue in jira, it works here if i have -filter at the > end. > > >> > >>> > > >> > > >> > >>> > > >> Cheers > > >> > >>> > > >> > > >> > >>> > > >> -----Original message----- > > >> > >>> > > >> > From:Joe Zhang <smartag...@gmail.com> > > >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05 > > >> > >>> > > >> > To: Markus Jelsma <markus.jel...@openindex.io>; user < > > >> > >>> > > >> user@nutch.apache.org> > > >> > >>> > > >> > Subject: Re: Indexing-time URL filtering again > > >> > >>> > > >> > > > >> > >>> > > >> > Yes, I forgot to do that. But still, what exactly > should > > >> the > > >> > >>> command > > >> > >>> > > >> look like? > > >> > >>> > > >> > > > >> > >>> > > >> > bin/nutch solrindex > > >> -Durlfilter.regex.file=....UrlFiltering.txt > > >> > >>> > > >> http://localhost:8983/solr/ <http://localhost:8983/solr/ > > > > >> > >>> .../crawldb/ > > >> > >>> > > >> ..../segments/* -filter > > >> > >>> > > >> > this command would cause nutch to interpret "-filter" > as a > > >> path. > > >> > >>> > > >> > > > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma < > > >> > >>> > > >> markus.jel...@openindex.io <mailto: > > >> markus.jel...@openindex.io> > > > >> > >>> wrote: > > >> > >>> > > >> > Hi, > > >> > >>> > > >> > > > >> > >>> > > >> > I just tested a small index job that usually writes > 1200 > > >> > >>> records to > > >> > >>> > > >> Solr. It works fine if i specify -. in a filter (index > > >> nothing) > > >> > >>> and > > >> > >>> > > point > > >> > >>> > > >> to it with -Durlfilter.regex.file=path like you do. I > > >> assume you > > >> > >>> mean > > >> > >>> > > by > > >> > >>> > > >> `it doesn't work` that it filters nothing and indexes all > > >> records > > >> > >>> from > > >> > >>> > > the > > >> > >>> > > >> segment. Did you forget the -filter parameter? > > >> > >>> > > >> > > > >> > >>> > > >> > Cheers > > >> > >>> > > >> > > > >> > >>> > > >> > -----Original message----- > > >> > >>> > > >> > > From:Joe Zhang <smartag...@gmail.com <mailto: > > >> > >>> smartag...@gmail.com> > > >> > >>> > > > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29 > > >> > >>> > > >> > > To: user <user@nutch.apache.org <mailto: > > >> user@nutch.apache.org> > > >> > >>> > > > >> > >>> > > >> > > Subject: Indexing-time URL filtering again > > >> > >>> > > >> > > > > >> > >>> > > >> > > Dear List: > > >> > >>> > > >> > > > > >> > >>> > > >> > > I asked a similar question before, but I haven't > solved > > >> the > > >> > >>> problem. > > >> > >>> > > >> > > Therefore I try to re-ask the question more clearly > and > > >> seek > > >> > >>> advice. > > >> > >>> > > >> > > > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things > > >> work > > >> > >>> fine at > > >> > >>> > > the > > >> > >>> > > >> > > rudimentary level. > > >> > >>> > > >> > > > > >> > >>> > > >> > > The basic problem I face in crawling/indexing is > that I > > >> need > > >> > >>> to > > >> > >>> > > control > > >> > >>> > > >> > > which pages the crawlers should VISIT (so far through > > >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt) > > >> > >>> > > >> > > and which pages are INDEXED by Solr. The latter are > only > > >> a > > >> > >>> SUBSET of > > >> > >>> > > >> the > > >> > >>> > > >> > > former, and they are giving me headache. > > >> > >>> > > >> > > > > >> > >>> > > >> > > A real-life example would be: when we crawl CNN.com, > we > > >> only > > >> > >>> want to > > >> > >>> > > >> index > > >> > >>> > > >> > > "real content" pages such as > > >> > >>> > > >> > > > > >> > >>> > > > > >> > >>> > > >> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1< > > >> > >>> > > >> > > >> > >>> > > >> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1> > > >> > >>> > > . > > >> > >>> > > >> > > When we start the crawling from the root, we can't > > >> specify > > >> > >>> tight > > >> > >>> > > >> > > patterns (e.g., +^http://([a-z0-9]*\.)* > > >> > >>> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*< > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> > > >> < > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>< > > >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in > > >> > >>> > > >> nutch/conf/regex-urlfilter.txt, > > >> > >>> > > >> > > because the pages on the path between root and > content > > >> pages > > >> > >>> do not > > >> > >>> > > >> satisfy > > >> > >>> > > >> > > such patterns. Putting such patterns in > > >> > >>> > > nutch/conf/regex-urlfilter.txt > > >> > >>> > > >> > > would severely jeopardize the coverage of the crawl. > > >> > >>> > > >> > > > > >> > >>> > > >> > > The closest solution I've got so far (courtesy of > > >> Markus) was > > >> > >>> this: > > >> > >>> > > >> > > > > >> > >>> > > >> > > nutch solrindex -Durlfilter.regex.file=/path > > >> http://solrurl/< > > >> > >>> > > >> http://solrurl/> ... > > >> > >>> > > >> > > > > >> > >>> > > >> > > but unfortunately I haven't been able to make it > work > > >> for > > >> > >>> me. The > > >> > >>> > > >> content > > >> > >>> > > >> > > of the urlfilter.regex.file is what I thought > "correct" > > >> --- > > >> > >>> > > something > > >> > >>> > > >> like > > >> > >>> > > >> > > the following: > > >> > >>> > > >> > > > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)* > > >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*< > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*> > > >> < > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>< > > >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> > > >> > >>> > > >> > > -. > > >> > >>> > > >> > > > > >> > >>> > > >> > > Everything seems quite straightforward. Am I doing > > >> anything > > >> > >>> wrong > > >> > >>> > > >> here? Can > > >> > >>> > > >> > > anyone advise? I'd greatly appreciate. > > >> > >>> > > >> > > > > >> > >>> > > >> > > Joe > > >> > >>> > > >> > > > > >> > >>> > > >> > > > >> > >>> > > >> > > > >> > >>> > > >> > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > -- > > >> > >>> > > Lewis > > >> > >>> > > > > >> > >>> > > > >> > >>> > > >> > >> > > >> > >> > > >> > > > > >> > > > >> > > > > > > > > >