Hi Bart

you can do this by set the regex-urlfilter.txt file, you can check this
tutorial [0][1]

for db.ignore.internal.links property, if it is true will NOT store the
links in a domain that point to the same domain. For example a link and
page www.domain.com/a.html that points to www.domain.com/b.html, This
significantly decreases the number links being stored in the link database.

for db.ignore.external.links property, if you like restrict a crawl to a
specified domain in a seed url without using the urlfilter-regex plugin.
This property looked like it would do the trick.

for a simple URL http://www.example.com, the host is "www.example.com",
This is specified in RFC 1738 see [2]
See section 3.1 on "Common Internet Sheme Syntax". [2]



[0] http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
[1]
http://wiki.apache.org/nutch/FAQ#Is_it_possible_to_fetch_only_pages_from_some_specific_domains.3F
[2] http://www.ietf.org/rfc/rfc1738.txt


On Sun, Feb 24, 2013 at 5:16 AM, jazz <[email protected]> wrote:

> Hi,
>
> I would like to prevent that nutch 2.1 is crawling links outside the
> injected URL. So, I would like it to crawl for example:
>
> www.apache.org: injected URL
> http://apache.org/foundation/
> http://projects.apache.org/
>
> And not: www.youtube.com
>
> How can this be achieved? This does not seem to work:
>
> This is how my NUTCH_HOME/conf/nutch-default.xml looks like
>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>true</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> I guess my question is what is defined by "host"? What I see happening is
> that is does not fetch all links within the site (
> www.apache.org/foundation etc) but starts fetching outlink contents (
> facebook.com, youtube.com etc)
>
> Regards, Bart




-- 
Don't Grow Old, Grow Up... :-)

Reply via email to