The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match.
On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > -----Original message----- > > From:Joe Zhang <smartag...@gmail.com> > > Sent: Fri 02-Nov-2012 10:04 > > To: user@nutch.apache.org > > Subject: URL filtering: crawling time vs. indexing time > > > > I feel like this is a trivial question, but I just can't get my ahead > > around it. > > > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the > > rudimentary level. > > > > If my understanding is correct, the regex-es in > > nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which > > URLs to visit or not in the crawling process. > > Yes. > > > > > On the other hand, it doesn't seem artificial for us to only want certain > > pages to be indexed. I was hoping to write some regular expressions as > well > > in some config file, but I just can't find the right place. My hunch > tells > > me that such things should not require into-the-box coding. Can anybody > > help? > > What exactly do you want? Add your custom regular expressions? The > regex-urlfilter.txt is the place to write them to. > > > > > Again, the scenario is really rather generic. Let's say we want to crawl > > http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops > and > > unncessary file types etc., but only expect to index pages with URLs > like: > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. > > To do this you must simply make sure your regular expressions can do this. > > > > > Am I too naive to expect zero Java coding in this case? > > No, you can achieve almost all kinds of exotic filtering with just the URL > filters and the regular expressions. > > Cheers > > >