[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-882:
-------------------------------

    Attachment: NUTCH-882-v3.txt
                NUTCH-882-v3.txt

New version of patch. (On behalf of Mathijs I am finishing this issue. 
Nevertheless he has done much of the hard work!)

Building hostdb links (inlinks and outlinks at the host level) works now too. 
Use:
org.apache.nutch.host.HostDbUpdateJob -linkDb

This patch adds Host store definitions to the gora mapping for HBase only. 
(Other stores can be added easily later on). It needs GORA-105. So you can only 
use the added functionality when using a trunk version of Gora. Or wait until 
Nutchgora updates to Gora 0.2. (Should be soon).

No tests are included yet. For now this is okay, because by default this patch 
does not change existing functionality. (Also it's a bit of a pain to add tests 
because current tests depend on a valid SQLStore but updating Gora results in a 
dropped SQLStore so there an issue that needs to be solved first. In another 
issue that is).

Will commit this in a few days.
                
> Design a Host table in GORA
> ---------------------------
>
>                 Key: NUTCH-882
>                 URL: https://issues.apache.org/jira/browse/NUTCH-882
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: nutchgora
>            Reporter: Julien Nioche
>             Fix For: nutchgora
>
>         Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, 
> hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to