No, this is no bug. As i said, you need either to patch your Nutch or get the sources from trunk. The -filter parameter is not in your version. Check the patch manual if you don't know how it works.
$ cd trunk ; patch -p0 < file.patch -----Original message----- > From:Joe Zhang <smartag...@gmail.com> > Sent: Sun 25-Nov-2012 08:42 > To: Markus Jelsma <markus.jel...@openindex.io>; user <user@nutch.apache.org> > Subject: Re: Indexing-time URL filtering again > > This does seem a bug. Can anybody help? > > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <smartag...@gmail.com> wrote: > > > Markus, could you advise? Thanks a lot! > > > > > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <smartag...@gmail.com> wrote: > > > >> I followed your instruction and applied the patch, Markus, but the > >> problem still persists --- "-filter" is interpreted as a path by solrindex. > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma < > >> markus.jel...@openindex.io> wrote: > >> > >>> Ah, i get it now. Please use trunk or patch your version with: > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering. > >>> > >>> -----Original message----- > >>> > From:Joe Zhang <smartag...@gmail.com> > >>> > Sent: Fri 23-Nov-2012 03:08 > >>> > To: user@nutch.apache.org > >>> > Subject: Re: Indexing-time URL filtering again > >>> > > >>> > But Markus said it worked for him. I was really he could send his > >>> command > >>> > line. > >>> > > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney < > >>> > lewis.mcgibb...@gmail.com> wrote: > >>> > > >>> > > Is this a bug? > >>> > > > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <smartag...@gmail.com> > >>> wrote: > >>> > > > Putting -filter between crawldb and segments, I sitll got the same > >>> thing: > >>> > > > > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > >>> > > exist: > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch > >>> > > > Input path does not exist: > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse > >>> > > > Input path does not exist: > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data > >>> > > > Input path does not exist: > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text > >>> > > > > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma > >>> > > > <markus.jel...@openindex.io>wrote: > >>> > > > > >>> > > >> These are roughly the available parameters: > >>> > > >> > >>> > > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] > >>> [-hostdb > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir > >>> <segments>) > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize] > >>> > > >> > >>> > > >> Having -filter at the end should work fine, if it, for some > >>> reason, > >>> > > >> doesn't work put it before the segment and after the crawldb and > >>> file an > >>> > > >> issue in jira, it works here if i have -filter at the end. > >>> > > >> > >>> > > >> Cheers > >>> > > >> > >>> > > >> -----Original message----- > >>> > > >> > From:Joe Zhang <smartag...@gmail.com> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05 > >>> > > >> > To: Markus Jelsma <markus.jel...@openindex.io>; user < > >>> > > >> user@nutch.apache.org> > >>> > > >> > Subject: Re: Indexing-time URL filtering again > >>> > > >> > > >>> > > >> > Yes, I forgot to do that. But still, what exactly should the > >>> command > >>> > > >> look like? > >>> > > >> > > >>> > > >> > bin/nutch solrindex -Durlfilter.regex.file=....UrlFiltering.txt > >>> > > >> http://localhost:8983/solr/ <http://localhost:8983/solr/> > >>> .../crawldb/ > >>> > > >> ..../segments/* -filter > >>> > > >> > this command would cause nutch to interpret "-filter" as a path. > >>> > > >> > > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma < > >>> > > >> markus.jel...@openindex.io <mailto:markus.jel...@openindex.io> > > >>> wrote: > >>> > > >> > Hi, > >>> > > >> > > >>> > > >> > I just tested a small index job that usually writes 1200 > >>> records to > >>> > > >> Solr. It works fine if i specify -. in a filter (index nothing) > >>> and > >>> > > point > >>> > > >> to it with -Durlfilter.regex.file=path like you do. I assume you > >>> mean > >>> > > by > >>> > > >> `it doesn't work` that it filters nothing and indexes all records > >>> from > >>> > > the > >>> > > >> segment. Did you forget the -filter parameter? > >>> > > >> > > >>> > > >> > Cheers > >>> > > >> > > >>> > > >> > -----Original message----- > >>> > > >> > > From:Joe Zhang <smartag...@gmail.com <mailto: > >>> smartag...@gmail.com> > >>> > > > > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29 > >>> > > >> > > To: user <user@nutch.apache.org <mailto:user@nutch.apache.org> > >>> > > >>> > > >> > > Subject: Indexing-time URL filtering again > >>> > > >> > > > >>> > > >> > > Dear List: > >>> > > >> > > > >>> > > >> > > I asked a similar question before, but I haven't solved the > >>> problem. > >>> > > >> > > Therefore I try to re-ask the question more clearly and seek > >>> advice. > >>> > > >> > > > >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work > >>> fine at > >>> > > the > >>> > > >> > > rudimentary level. > >>> > > >> > > > >>> > > >> > > The basic problem I face in crawling/indexing is that I need > >>> to > >>> > > control > >>> > > >> > > which pages the crawlers should VISIT (so far through > >>> > > >> > > nutch/conf/regex-urlfilter.txt) > >>> > > >> > > and which pages are INDEXED by Solr. The latter are only a > >>> SUBSET of > >>> > > >> the > >>> > > >> > > former, and they are giving me headache. > >>> > > >> > > > >>> > > >> > > A real-life example would be: when we crawl CNN.com, we only > >>> want to > >>> > > >> index > >>> > > >> > > "real content" pages such as > >>> > > >> > > > >>> > > > >>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1< > >>> > > >> > >>> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1> > >>> > > . > >>> > > >> > > When we start the crawling from the root, we can't specify > >>> tight > >>> > > >> > > patterns (e.g., +^http://([a-z0-9]*\.)* > >>> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>< > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in > >>> > > >> nutch/conf/regex-urlfilter.txt, > >>> > > >> > > because the pages on the path between root and content pages > >>> do not > >>> > > >> satisfy > >>> > > >> > > such patterns. Putting such patterns in > >>> > > nutch/conf/regex-urlfilter.txt > >>> > > >> > > would severely jeopardize the coverage of the crawl. > >>> > > >> > > > >>> > > >> > > The closest solution I've got so far (courtesy of Markus) was > >>> this: > >>> > > >> > > > >>> > > >> > > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/< > >>> > > >> http://solrurl/> ... > >>> > > >> > > > >>> > > >> > > but unfortunately I haven't been able to make it work for > >>> me. The > >>> > > >> content > >>> > > >> > > of the urlfilter.regex.file is what I thought "correct" --- > >>> > > something > >>> > > >> like > >>> > > >> > > the following: > >>> > > >> > > > >>> > > >> > > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>< > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> > >>> > > >> > > -. > >>> > > >> > > > >>> > > >> > > Everything seems quite straightforward. Am I doing anything > >>> wrong > >>> > > >> here? Can > >>> > > >> > > anyone advise? I'd greatly appreciate. > >>> > > >> > > > >>> > > >> > > Joe > >>> > > >> > > > >>> > > >> > > >>> > > >> > > >>> > > >> > >>> > > > >>> > > > >>> > > > >>> > > -- > >>> > > Lewis > >>> > > > >>> > > >>> > >> > >> > > >