You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*mysite.com.
Le 2 nov. 2012 à 17:22, Joe Zhang <smartag...@gmail.com> a écrit : > The problem is that, > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end > up indexing all the pages on the way, not just the leaf pages. > - if you write specific regex for > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you > start crawling at mysite.com, you'll get zero results, as there is no match. > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma > <markus.jel...@openindex.io>wrote: > >> -----Original message----- >>> From:Joe Zhang <smartag...@gmail.com> >>> Sent: Fri 02-Nov-2012 10:04 >>> To: user@nutch.apache.org >>> Subject: URL filtering: crawling time vs. indexing time >>> >>> I feel like this is a trivial question, but I just can't get my ahead >>> around it. >>> >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the >>> rudimentary level. >>> >>> If my understanding is correct, the regex-es in >>> nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which >>> URLs to visit or not in the crawling process. >> >> Yes. >> >>> >>> On the other hand, it doesn't seem artificial for us to only want certain >>> pages to be indexed. I was hoping to write some regular expressions as >> well >>> in some config file, but I just can't find the right place. My hunch >> tells >>> me that such things should not require into-the-box coding. Can anybody >>> help? >> >> What exactly do you want? Add your custom regular expressions? The >> regex-urlfilter.txt is the place to write them to. >> >>> >>> Again, the scenario is really rather generic. Let's say we want to crawl >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops >> and >>> unncessary file types etc., but only expect to index pages with URLs >> like: >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. >> >> To do this you must simply make sure your regular expressions can do this. >> >>> >>> Am I too naive to expect zero Java coding in this case? >> >> No, you can achieve almost all kinds of exotic filtering with just the URL >> filters and the regular expressions. >> >> Cheers >>> >>