Hi,

I would like to prevent that nutch 2.1 is crawling links outside the injected 
URL. So, I would like it to crawl for example:

www.apache.org: injected URL
http://apache.org/foundation/
http://projects.apache.org/ 

And not: www.youtube.com

How can this be achieved? This does not seem to work:

This is how my NUTCH_HOME/conf/nutch-default.xml looks like

<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

I guess my question is what is defined by "host"? What I see happening is that 
is does not fetch all links within the site (www.apache.org/foundation etc) but 
starts fetching outlink contents (facebook.com, youtube.com etc)

Regards, Bart

Reply via email to