[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848517#comment-13848517 ]
Tejas Patil commented on NUTCH-1325: ------------------------------------ Hi [~markus17], I stopped by this Jira (after a long time !!!) with an intention of getting it to a stage where we could have it inside trunk. You had replied to my two concerns. For (1): {noformat}host_a.example.org, host_b.example.org ==> example.org{noformat} This might *NOT* be a good idea. (a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted independently. It can be argued to consider them as different hosts. (b) I am not sure about the standards, but if something like "uci.cs.edu" is valid (subdomain is suffix of domain) then there would be a problem when we resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu". For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed." Do you have any suggestion to work this out ? > HostDB for Nutch > ---------------- > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.4#6159)