hi Vince,

have you tried this property?

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

HTH,
Renaud



Vince Filby wrote:
> Hello,
>
> Is there a way to tell Nutch to only follow links within the domain it is
> currently crawling?
>
> What I would like to do is pass a list of Url's and Nutch should ignore all
> outbound links from any domain other than the domain that the link comes
> from.  Let's say that I am crawling www.test1.com, I should only follow
> links to www.test1.com.
>
> I realize that I can do this with regex filter *if* I add a regex rule for
> *each* site that I want to crawl, this solution doesn't scale well for my
> project.  I have also read of a db based url filter that will maintain a
> list of accepted url's in a database.  This also doesn't fit well since I
> don't want to maintain the crawl list and the accepted domain database.  I
> can but it is rather clunky.
>
> I have poked around the source and it looks like the url filtering mechanism
> only passes the link url and returns a url.  So it appears that this is not
> really possible at the code level without source modifications.  I would
> just like to confirm that I am not missing anything obvious before I start
> reworking the code.
>
> Cheers,
> Vince
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to