[ https://issues.apache.org/jira/browse/HBASE-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237726#comment-13237726 ]
stack commented on HBASE-5604: ------------------------------ I like the postgres link. Should point at that when we doc this tool. bq. This could an M/R (with a mapper per HLog file). You'll need a reduce (I think you deduce this yourself above but saying it for completeness). If a map-only MR job, if many WALs, say 100s, you could end up w/ the same amount of hfiles per region (if each WAL had at least one edit for this region). You'd need a reducer to coalesce by region. This tool would not apply the edits in the order in which we received them. We'd be reliant on sort order only which should be fine I think since this is what happens if they were instead inserted via the memstore anyways. bq. Hmm... Maybe this is only useful when we have a lot of logs ....maybe there would be no advantage here turning this in an M/R job, but maybe it should just be a standalone client...? Well, even if tens of files only, you'd want to //ize it to do the filtering, etc., so MR sounds right. Or you could hack on the distributed split code to add a 'filtering' facility... so it dropped edits that were outside of a range -- e.g. not one of the specified tables or not of a time range. The output of distributed log splitting is only replayed on region open so you'd need to figure how to get the region to load the edits (An MR job to write hfiles sounds way more straightforward relatively). > HLog replay tool that generates HFiles for use by LoadIncrementalHFiles. > ------------------------------------------------------------------------ > > Key: HBASE-5604 > URL: https://issues.apache.org/jira/browse/HBASE-5604 > Project: HBase > Issue Type: New Feature > Reporter: Lars Hofhansl > > Just an idea I had. Might be useful for restore of a backup using the HLogs. > This could an M/R (with a mapper per HLog file). > The tool would get a timerange and a (set of) table(s). We'd pick the right > HLogs based on time before the M/R job is started and then have a mapper per > HLog file. > The mapper would then go through the HLog, filter all WALEdits that didn't > fit into the time range or are not any of the tables and then uses > HFileOutputFormat to generate HFiles. > Would need to indicate the splits we want, probably from a live table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira