[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653856#action_12653856 ]
Andrzej Bialecki commented on NUTCH-668: ----------------------------------------- The test case contains a reference to a path on your local machine ... Also, the issue of domain vs. subdomain vs. host matching ... I'd love to be able to specify patterns like this: edu example.com blurfl.foobar.org meaning: accept everything from .com TLD, everything from example.com including subdomains and hosts, and anything from blurfl.foobar.org, whether that's a hostname or a subdomain. We could do it with a suffix tree, or by matching the increasing number of hostname elements to the HashSet, e.g. for www.blurfl.foobar.org we would check: org - no match foobar.org - no match blurfl.foobar.org - match, break and return For www.foobar.com we would check: com - no match foobar.com - no match www.foobar - no match return null The price is that we need to make as many probes in the HashSet as there are domain elements, but the advantage is the increased flexibility in configuring allowed domains / hosts. I'm also fine if you want to commit it as it is, and create an issue to enhance this plugin later. > Domain URL Filter > ----------------- > > Key: NUTCH-668 > URL: https://issues.apache.org/jira/browse/NUTCH-668 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch > > > A URLFilter that adds the ability to filter out URLs by top level domain or > by hostname. A configuration file with a listing of URLs is used to denote > accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.