URL filtering: crawling time vs. indexing time

Joe Zhang Fri, 02 Nov 2012 01:59:14 -0700

I feel like this is a trivial question, but I just can't get my ahead
around it.


I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
rudimentary level.

If my understanding is correct, the regex-es in
nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
URLs to visit or not in the crawling process.

On the other hand, it doesn't seem artificial for us to only want certain
pages to be indexed. I was hoping to write some regular expressions as well
in some config file, but I just can't find the right place. My hunch tells
me that such things should not require into-the-box coding. Can anybody
help?

Again, the scenario is really rather generic. Let's say we want to crawl
http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and
unncessary file types etc., but only expect to index pages with URLs like:
http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.

Am I too naive to expect zero Java coding in this case?

URL filtering: crawling time vs. indexing time

Reply via email to