Hi Bart you can do this by set the regex-urlfilter.txt file, you can check this tutorial [0][1]
for db.ignore.internal.links property, if it is true will NOT store the links in a domain that point to the same domain. For example a link and page www.domain.com/a.html that points to www.domain.com/b.html, This significantly decreases the number links being stored in the link database. for db.ignore.external.links property, if you like restrict a crawl to a specified domain in a seed url without using the urlfilter-regex plugin. This property looked like it would do the trick. for a simple URL http://www.example.com, the host is "www.example.com", This is specified in RFC 1738 see [2] See section 3.1 on "Common Internet Sheme Syntax". [2] [0] http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website [1] http://wiki.apache.org/nutch/FAQ#Is_it_possible_to_fetch_only_pages_from_some_specific_domains.3F [2] http://www.ietf.org/rfc/rfc1738.txt On Sun, Feb 24, 2013 at 5:16 AM, jazz <[email protected]> wrote: > Hi, > > I would like to prevent that nutch 2.1 is crawling links outside the > injected URL. So, I would like it to crawl for example: > > www.apache.org: injected URL > http://apache.org/foundation/ > http://projects.apache.org/ > > And not: www.youtube.com > > How can this be achieved? This does not seem to work: > > This is how my NUTCH_HOME/conf/nutch-default.xml looks like > > <property> > <name>db.ignore.internal.links</name> > <value>true</value> > <description>If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > </description> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>true</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to include > only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > I guess my question is what is defined by "host"? What I see happening is > that is does not fetch all links within the site ( > www.apache.org/foundation etc) but starts fetching outlink contents ( > facebook.com, youtube.com etc) > > Regards, Bart -- Don't Grow Old, Grow Up... :-)

