[jira] [Commented] (HBASE-5604) HLog replay tool that generates HFiles for use by LoadIncrementalHFiles.

stack (Commented) (JIRA) Sat, 24 Mar 2012 16:38:51 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237726#comment-13237726
 ]


stack commented on HBASE-5604:
------------------------------

I like the postgres link.  Should point at that when we doc this tool.

bq. This could an M/R (with a mapper per HLog file).

You'll need a reduce (I think you deduce this yourself above but saying it for 
completeness).  If a map-only MR job, if many WALs, say 100s, you could end up 
w/ the same amount of hfiles per region (if each WAL had at least one edit for 
this region).  You'd need a reducer to coalesce by region.

This tool would not apply the edits in the order in which we received them.  
We'd be reliant on sort order only which should be fine I think since this is 
what happens if they were instead inserted via the memstore anyways.

bq. Hmm... Maybe this is only useful when we have a lot of logs ....maybe there 
would be no advantage here turning this in an M/R job, but maybe it should just 
be a standalone client...?

Well, even if tens of files only, you'd want to //ize it to do the filtering, 
etc., so MR sounds right.

Or you could hack on the distributed split code to add a 'filtering' 
facility... so it dropped edits that were outside of a range -- e.g. not one of 
the specified tables or not of a time range.  The output of distributed log 
splitting is only replayed on region open so you'd need to figure how to get 
the region to load the edits (An MR job to write hfiles sounds way more 
straightforward relatively).

                
> HLog replay tool that generates HFiles for use by LoadIncrementalHFiles.
> ------------------------------------------------------------------------
>
>                 Key: HBASE-5604
>                 URL: https://issues.apache.org/jira/browse/HBASE-5604
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Lars Hofhansl
>
> Just an idea I had. Might be useful for restore of a backup using the HLogs.
> This could an M/R (with a mapper per HLog file).
> The tool would get a timerange and a (set of) table(s). We'd pick the right 
> HLogs based on time before the M/R job is started and then have a mapper per 
> HLog file.
> The mapper would then go through the HLog, filter all WALEdits that didn't 
> fit into the time range or are not any of the tables and then uses 
> HFileOutputFormat to generate HFiles.
> Would need to indicate the splits we want, probably from a live table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5604) HLog replay tool that generates HFiles for use by LoadIncrementalHFiles.

Reply via email to