[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tejas Patil updated NUTCH-1325: ------------------------------- Attachment: NUTCH-1325.trunk.v2.path Hi [~markus17], The initial patch is good. This feature would be a good addition to nutch :) I did some minor changes to it (NUTCH-1325.trunk.v2.path) mainly to make it work with the current trunk. Sorry for bringing this up (after one entire year). Would it be ok if I take this work forward ? If "yes", then kindly provide me more details about the stuff in "TODO": (1) DumpHostDb class doesnt has a reducer and there was this comment there: {noformat}reduce unknown hosts to single unknown domain if possible. Enable via configuration host_a.example.org,host_a.example.org ==> example.org{noformat} In the example, both the hosts were same. Are these ok: - host_a.example.org, host_b.example.org ==> example.org - x.xyz.org, a.abc.org ==> unknown (2) In the UpdateHostDb class, map() method: {noformat}TODO: fix multi redirects: host_a => host_b/page => host_c/page/whatever http://www.ferienwohnung-armbruster.de/ http://www.ferienwohnung-armbruster.de/website/ http://www.ferienwohnung-armbruster.de/website/willkommen.php We cannot reresolve redirects for host objects as CrawlDatum metadata is not available. We also cannot reliably use the reducer in all cases since redirects may be across hosts or even domains. The example above has redirects that will end up in the same reducer. During that phase, however, we do not know which URL redirects to the next URL.{noformat} The example is not showing the case when the re-directions are across different hosts. > HostDB for Nutch > ---------------- > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.7 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira