Hi, HBase with multiple versions is certainly an option, however the current HBaseStore implementation is implemented with a single version in mind. (I have not really tested what happens with multiple versions, I guess you get unexpected/undefined results). The exception to this case would be to setting specific column families for multiple values (for example 'content' and put it into a separate column family). Storing new content would overwrite the old ones. You have to have an external process or implemented tool to retrieve earlier versions from the store. For information like maps (inlinks, outlinks, metadata) the results with multiple versions are a lot more confusing. There is still some work to do.
In short, yes HBase would work but you definitely would have to hack a custom HBaseStore if you want to perfectly keep track of snapshots. Ferdy. On Tue, Oct 9, 2012 at 3:30 PM, Julien Nioche <lists.digitalpeb...@gmail.com > wrote: > Hi James > > You could have a custom map reduce job to copy the documents with a custom > ID as you just described. Another option would be to use Nutch 2 + HBase > and set a large value of versions ( > http://hbase.apache.org/book/schema.versions.html) in the HBase schema. > > Julien > > On 9 October 2012 11:17, <j.sulli...@thomsonreuters.com> wrote: > > > Hi > > > > Rather than a wide crawl of the web keeping track of the current state of > > sites (as I understand Nutch is currently optimized for) I am interested > in > > keeping copies of a more modest number of sites over time as they change. > > In other words keeping copies of both the old webpages and the new pages > as > > they change. My overly optimistic wishful thinking is that I could get > > close enough to this by simply adding the signature (TextProfileSignature > > in particular) to the current id key. Any thoughts as to if this is > > feasible and if so where in the codebase I should start looking in order > to > > do that? I am aware Heritrix specializes in archiving but I would really > > like to stick with Nutch if possible unless it absolutely doesn't make > > sense. > > > > Thanks > > > > James > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >