Yes, I forgot to do that. But still, what exactly should the command look like?
bin/nutch solrindex -Durlfilter.regex.file=....UrlFiltering.txt http://localhost:8983/solr/ .../crawldb/ ..../segments/* -filter this command would cause nutch to interpret "-filter" as a path. On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi, > > I just tested a small index job that usually writes 1200 records to Solr. > It works fine if i specify -. in a filter (index nothing) and point to it > with -Durlfilter.regex.file=path like you do. I assume you mean by `it > doesn't work` that it filters nothing and indexes all records from the > segment. Did you forget the -filter parameter? > > Cheers > > -----Original message----- > > From:Joe Zhang <smartag...@gmail.com> > > Sent: Thu 22-Nov-2012 07:29 > > To: user <user@nutch.apache.org> > > Subject: Indexing-time URL filtering again > > > > Dear List: > > > > I asked a similar question before, but I haven't solved the problem. > > Therefore I try to re-ask the question more clearly and seek advice. > > > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the > > rudimentary level. > > > > The basic problem I face in crawling/indexing is that I need to control > > which pages the crawlers should VISIT (so far through > > nutch/conf/regex-urlfilter.txt) > > and which pages are INDEXED by Solr. The latter are only a SUBSET of the > > former, and they are giving me headache. > > > > A real-life example would be: when we crawl CNN.com, we only want to > index > > "real content" pages such as > > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1. > > When we start the crawling from the root, we can't specify tight > > patterns (e.g., +^http://([a-z0-9]*\.)* > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in > nutch/conf/regex-urlfilter.txt, > > because the pages on the path between root and content pages do not > satisfy > > such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt > > would severely jeopardize the coverage of the crawl. > > > > The closest solution I've got so far (courtesy of Markus) was this: > > > > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... > > > > but unfortunately I haven't been able to make it work for me. The > content > > of the urlfilter.regex.file is what I thought "correct" --- something > like > > the following: > > > > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* > > -. > > > > Everything seems quite straightforward. Am I doing anything wrong here? > Can > > anyone advise? I'd greatly appreciate. > > > > Joe > > >