[ https://issues.apache.org/jira/browse/NUTCH-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-830: -------------------------------- Attachment: NUTCH-830.patch > ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds > ---------------------------------------------------------------------------- > > Key: NUTCH-830 > URL: https://issues.apache.org/jira/browse/NUTCH-830 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.1 > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-830.patch > > > The DomainURLFilter allows to specify the domains to consider for a crawl. > This works fine but requires to edit a list of domain / hosts manually. The > patch presented here offers the same functionality but uses a different > mechanism as we use a custom scoring filter to filter the outlinks. > 1. add a metadata to your seed list e.g. '_origin_' with as values the seed > URL > e.g. http://www.cnn.com/ _origin_=http://www.cnn.com/ > 2. The custom scoring filter would take care of : > * transmitting the origin metadata to its outlinks > * remove from the outlinks the ones which do not have the same host / > domain as the origin > The parameter _scoring.insite.mode_ allows to specify whether to restrict on > the host or domain. The parameter _scoring.insite.addOriginOnInject_ allows > to addition of the metadata during the injection step and reuses the URL > automatically. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.