[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferdy Galema updated NUTCH-882: ------------------------------- Attachment: NUTCH-882-v3.txt NUTCH-882-v3.txt New version of patch. (On behalf of Mathijs I am finishing this issue. Nevertheless he has done much of the hard work!) Building hostdb links (inlinks and outlinks at the host level) works now too. Use: org.apache.nutch.host.HostDbUpdateJob -linkDb This patch adds Host store definitions to the gora mapping for HBase only. (Other stores can be added easily later on). It needs GORA-105. So you can only use the added functionality when using a trunk version of Gora. Or wait until Nutchgora updates to Gora 0.2. (Should be soon). No tests are included yet. For now this is okay, because by default this patch does not change existing functionality. (Also it's a bit of a pain to add tests because current tests depend on a valid SQLStore but updating Gora results in a dropped SQLStore so there an issue that needs to be solved first. In another issue that is). Will commit this in a few days. > Design a Host table in GORA > --------------------------- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature > Affects Versions: nutchgora > Reporter: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, > hostdb.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira