Hi, I would like to prevent that nutch 2.1 is crawling links outside the injected URL. So, I would like it to crawl for example:
www.apache.org: injected URL http://apache.org/foundation/ http://projects.apache.org/ And not: www.youtube.com How can this be achieved? This does not seem to work: This is how my NUTCH_HOME/conf/nutch-default.xml looks like <property> <name>db.ignore.internal.links</name> <value>true</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> I guess my question is what is defined by "host"? What I see happening is that is does not fetch all links within the site (www.apache.org/foundation etc) but starts fetching outlink contents (facebook.com, youtube.com etc) Regards, Bart

