[ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653856#action_12653856
 ] 

Andrzej Bialecki  commented on NUTCH-668:
-----------------------------------------

The test case contains a reference to a path on your local machine ...

Also, the issue of domain vs. subdomain vs. host matching ... I'd love to be 
able to specify patterns like this:

edu
example.com
blurfl.foobar.org

meaning: accept everything from .com TLD, everything from example.com including 
subdomains and hosts, and anything from blurfl.foobar.org, whether that's a 
hostname or a subdomain.

We could do it with a suffix tree, or by matching the increasing number of 
hostname elements to the HashSet, e.g. for www.blurfl.foobar.org we would check:

 org - no match
 foobar.org - no match
 blurfl.foobar.org - match, break and return

For www.foobar.com we would check:

 com - no match
 foobar.com - no match
 www.foobar - no match
 return null

The price is that we need to make as many probes in the HashSet as there are 
domain elements, but the advantage is the increased flexibility in configuring 
allowed domains / hosts.

I'm also fine if you want to commit it as it is, and create an issue to enhance 
this plugin later.



> Domain URL Filter
> -----------------
>
>                 Key: NUTCH-668
>                 URL: https://issues.apache.org/jira/browse/NUTCH-668
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or 
> by hostname.  A configuration file with a listing of URLs is used to denote 
> accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to