[ 
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363807 ] 

Philippe EUGENE commented on NUTCH-173:
---------------------------------------

I have more than 5.000 hosts in my directory. I'm not sure about crawl 
performance with more than 5.000 rules.
It's easier for me to just manage a boolean value in the nutch conf.
I know this is not the natural way of crawl with Nutch, but it could be 
interested for somes nutch's user.
The most important problem  : scoring from external links is affected by this 
patch.


> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of 
> host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : 
> crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links 
> of the host.
> So the crawl is limited to the host that you inject at the beginning at the 
> crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 
> 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to 
> nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to