[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
-------------------------------

    Attachment: NUTCH-1325.trunk.v2.path

Hi [~markus17],
The initial patch is good. This feature would be a good addition to nutch :) 
I did some minor changes to it (NUTCH-1325.trunk.v2.path) mainly to make it 
work with the current trunk.

Sorry for bringing this up (after one entire year). Would it be ok if I take 
this work forward ? 

If "yes", then kindly provide me more details about the stuff in "TODO":
(1) DumpHostDb class doesnt has a reducer and there was this comment there:
{noformat}reduce unknown hosts to single unknown domain if possible. Enable via 
configuration
host_a.example.org,host_a.example.org ==> example.org{noformat}

In the example, both the hosts were same. Are these ok:  
- host_a.example.org, host_b.example.org ==> example.org
- x.xyz.org, a.abc.org ==> unknown

(2) In the UpdateHostDb class, map() method:
{noformat}TODO: fix multi redirects: host_a => host_b/page => 
host_c/page/whatever
http://www.ferienwohnung-armbruster.de/
http://www.ferienwohnung-armbruster.de/website/
http://www.ferienwohnung-armbruster.de/website/willkommen.php

We cannot reresolve redirects for host objects as CrawlDatum metadata is
not available. We also cannot reliably use the reducer in all cases since
redirects may be across hosts or even domains. The example above has
redirects that will end up in the same reducer. During that phase,
however, we do not know which URL redirects to the next URL.{noformat}

The example is not showing the case when the re-directions are across different 
hosts.
                
> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.7
>
>         Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to