Re: Indexing-time URL filtering again

Joe Zhang Thu, 22 Nov 2012 14:00:24 -0800

Yes, I forgot to do that. But still, what exactly should the command look
like?


bin/nutch solrindex  -Durlfilter.regex.file=....UrlFiltering.txt
http://localhost:8983/solr/ .../crawldb/ ..../segments/*  -filter
this command would cause nutch to interpret "-filter" as a path.

On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Hi,
>
> I just tested a small index job that usually writes 1200 records to Solr.
> It works fine if i specify -. in a filter (index nothing) and point to it
> with -Durlfilter.regex.file=path like you do.  I assume you mean by `it
> doesn't work` that it filters nothing and indexes all records from the
> segment. Did you forget the -filter parameter?
>
> Cheers
>
> -----Original message-----
> > From:Joe Zhang <smartag...@gmail.com>
> > Sent: Thu 22-Nov-2012 07:29
> > To: user <user@nutch.apache.org>
> > Subject: Indexing-time URL filtering again
> >
> > Dear List:
> >
> > I asked a similar question before, but I haven't solved the problem.
> > Therefore I try to re-ask the question more clearly and seek advice.
> >
> > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> > rudimentary level.
> >
> > The basic problem I face in crawling/indexing is that I need to control
> > which pages the crawlers should VISIT (so far through
> > nutch/conf/regex-urlfilter.txt)
> > and which pages are INDEXED by Solr. The latter are only a SUBSET of the
> > former, and they are giving me headache.
> >
> > A real-life example would be: when we crawl CNN.com, we only want to
> index
> > "real content" pages such as
> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1.
> > When we start the crawling from the root, we can't specify tight
> > patterns (e.g., +^http://([a-z0-9]*\.)*
> > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in
> nutch/conf/regex-urlfilter.txt,
> > because the pages on the path between root and content pages do not
> satisfy
> > such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt
> > would severely jeopardize the coverage of the crawl.
> >
> > The closest solution I've got so far (courtesy of Markus) was this:
> >
> > nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...
> >
> >  but unfortunately I haven't been able to make it work for me. The
> content
> > of the urlfilter.regex.file is what I thought "correct" --- something
> like
> > the following:
> >
> > +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*
> > -.
> >
> > Everything seems quite straightforward. Am I doing anything wrong here?
> Can
> > anyone advise? I'd greatly appreciate.
> >
> > Joe
> >
>

Re: Indexing-time URL filtering again

Reply via email to