RE: Indexing-time URL filtering again

Markus Jelsma Thu, 22 Nov 2012 14:07:04 -0800

These are roughly the available parameters:

Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-hostdb <hostdb>] 
[-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] 
[-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] 
[-normalize]


Having -filter at the end should work fine, if it, for some reason, doesn't 
work put it before the segment and after the crawldb and file an issue in jira, 
it works here if i have -filter at the end.

Cheers
 
-----Original message-----
> From:Joe Zhang <smartag...@gmail.com>
> Sent: Thu 22-Nov-2012 23:05
> To: Markus Jelsma <markus.jel...@openindex.io>; user <user@nutch.apache.org>
> Subject: Re: Indexing-time URL filtering again
> 
> Yes, I forgot to do that. But still, what exactly should the command look 
> like?
>  
> bin/nutch solrindex  -Durlfilter.regex.file=....UrlFiltering.txt 
> http://localhost:8983/solr/ <http://localhost:8983/solr/> .../crawldb/ 
> ..../segments/*  -filter
> this command would cause nutch to interpret "-filter" as a path.
> 
> On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <markus.jel...@openindex.io 
> <mailto:markus.jel...@openindex.io> > wrote:
> Hi,
> 
> I just tested a small index job that usually writes 1200 records to Solr. It 
> works fine if i specify -. in a filter (index nothing) and point to it with 
> -Durlfilter.regex.file=path like you do.  I assume you mean by `it doesn't 
> work` that it filters nothing and indexes all records from the segment. Did 
> you forget the -filter parameter?
> 
> Cheers
> 
> -----Original message-----
> > From:Joe Zhang <smartag...@gmail.com <mailto:smartag...@gmail.com> >
> > Sent: Thu 22-Nov-2012 07:29
> > To: user <user@nutch.apache.org <mailto:user@nutch.apache.org> >
> > Subject: Indexing-time URL filtering again
> >
> > Dear List:
> >
> > I asked a similar question before, but I haven't solved the problem.
> > Therefore I try to re-ask the question more clearly and seek advice.
> >
> > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> > rudimentary level.
> >
> > The basic problem I face in crawling/indexing is that I need to control
> > which pages the crawlers should VISIT (so far through
> > nutch/conf/regex-urlfilter.txt)
> > and which pages are INDEXED by Solr. The latter are only a SUBSET of the
> > former, and they are giving me headache.
> >
> > A real-life example would be: when we crawl CNN.com, we only want to index
> > "real content" pages such as
> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1 
> > <http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1> .
> > When we start the crawling from the root, we can't specify tight
> > patterns (e.g., +^http://([a-z0-9]*\.)*
> > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* 
> > <http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in 
> > nutch/conf/regex-urlfilter.txt,
> > because the pages on the path between root and content pages do not satisfy
> > such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt
> > would severely jeopardize the coverage of the crawl.
> >
> > The closest solution I've got so far (courtesy of Markus) was this:
> >
> > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ 
> > <http://solrurl/> ...
> >
> >  but unfortunately I haven't been able to make it work for me. The content
> > of the urlfilter.regex.file is what I thought "correct" --- something like
> > the following:
> >
> > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* 
> > <http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> 
> > -.
> >
> > Everything seems quite straightforward. Am I doing anything wrong here? Can
> > anyone advise? I'd greatly appreciate.
> >
> > Joe
> >
> 
>

RE: Indexing-time URL filtering again

Reply via email to