[ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
-------------------------------

    Attachment: NUTCH-668-1-20081202.patch

Includes the DomainURLFilter and test files.  Domains can either be filtered by 
top level domains ignoring subdomains, or by hostnames through configuration.  
There is a configuration file where valid domains are placed one per line.  
Those domains are used to create valid domain set against which we validate 
urls at runtime.  Only urls which match domains in the domain set are 
considered valid.

> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or 
> by hostname.  A configuration file with a listing of URLs is used to denote 
> accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to