I feel like this is a trivial question, but I just can't get my ahead around it.
I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. Am I too naive to expect zero Java coding in this case?