The problem is that,

- if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end
up indexing all the pages on the way, not just the leaf pages.
- if you write specific regex for
http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you
start crawling at mysite.com, you'll get zero results, as there is no match.

On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <markus.jel...@openindex.io>wrote:

> -----Original message-----
> > From:Joe Zhang <smartag...@gmail.com>
> > Sent: Fri 02-Nov-2012 10:04
> > To: user@nutch.apache.org
> > Subject: URL filtering: crawling time vs. indexing time
> >
> > I feel like this is a trivial question, but I just can't get my ahead
> > around it.
> >
> > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> > rudimentary level.
> >
> > If my understanding is correct, the regex-es in
> > nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
> > URLs to visit or not in the crawling process.
>
> Yes.
>
> >
> > On the other hand, it doesn't seem artificial for us to only want certain
> > pages to be indexed. I was hoping to write some regular expressions as
> well
> > in some config file, but I just can't find the right place. My hunch
> tells
> > me that such things should not require into-the-box coding. Can anybody
> > help?
>
> What exactly do you want? Add your custom regular expressions? The
> regex-urlfilter.txt is the place to write them to.
>
> >
> > Again, the scenario is really rather generic. Let's say we want to crawl
> > http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops
> and
> > unncessary file types etc., but only expect to index pages with URLs
> like:
> > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
>
> To do this you must simply make sure your regular expressions can do this.
>
> >
> > Am I too naive to expect zero Java coding in this case?
>
> No, you can achieve almost all kinds of exotic filtering with just the URL
> filters and the regular expressions.
>
> Cheers
> >
>

Reply via email to