[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-668: ------------------------------- Attachment: NUTCH-668-1-20081202.patch Includes the DomainURLFilter and test files. Domains can either be filtered by top level domains ignoring subdomains, or by hostnames through configuration. There is a configuration file where valid domains are placed one per line. Those domains are used to create valid domain set against which we validate urls at runtime. Only urls which match domains in the domain set are considered valid. > Domain URL Filter > ----------------- > > Key: NUTCH-668 > URL: https://issues.apache.org/jira/browse/NUTCH-668 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-668-1-20081202.patch > > > A URLFilter that adds the ability to filter out URLs by top level domain or > by hostname. A configuration file with a listing of URLs is used to denote > accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.