[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848517#comment-13848517
 ] 

Tejas Patil commented on NUTCH-1325:
------------------------------------

Hi [~markus17],
I stopped by this Jira (after a long time !!!) with an intention of getting it 
to a stage where we could have it inside trunk. 
You had replied to my two concerns.

For (1): 
{noformat}host_a.example.org, host_b.example.org ==> example.org{noformat}

This might *NOT* be a good idea. 
(a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted 
independently. It can be argued to consider them as different hosts.
(b) I am not sure about the standards, but if something like "uci.cs.edu" is 
valid (subdomain is suffix of domain) then there would be a problem when we 
resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".

For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We 
have a modified domain filter that optionally takes a scheme so we can force 
HTTPS for specific domains. Those domains are filtered out because HTTP is not 
allowed."
Do you have any suggestion to work this out ?

> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to