hi Vince, have you tried this property?
<property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> HTH, Renaud Vince Filby wrote: > Hello, > > Is there a way to tell Nutch to only follow links within the domain it is > currently crawling? > > What I would like to do is pass a list of Url's and Nutch should ignore all > outbound links from any domain other than the domain that the link comes > from. Let's say that I am crawling www.test1.com, I should only follow > links to www.test1.com. > > I realize that I can do this with regex filter *if* I add a regex rule for > *each* site that I want to crawl, this solution doesn't scale well for my > project. I have also read of a db based url filter that will maintain a > list of accepted url's in a database. This also doesn't fit well since I > don't want to maintain the crawl list and the accepted domain database. I > can but it is rather clunky. > > I have poked around the source and it looks like the url filtering mechanism > only passes the link url and returns a url. So it appears that this is not > really possible at the code level without source modifications. I would > just like to confirm that I am not missing anything obvious before I start > reworking the code. > > Cheers, > Vince > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
