[ https://issues.apache.org/jira/browse/HBASE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lars Hofhansl updated HBASE-3936: --------------------------------- Fix Version/s: (was: 0.94.0) 0.96.0 > Incremental bulk load support for Increments > -------------------------------------------- > > Key: HBASE-3936 > URL: https://issues.apache.org/jira/browse/HBASE-3936 > Project: HBase > Issue Type: Improvement > Reporter: Andrew Purtell > Assignee: Andrew Purtell > Fix For: 0.96.0 > > > From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a > MapReduce job to output table data in HBase's internal data format, and then > directly loads the data files into a running cluster. Using bulk load will > use less CPU and network than going via the HBase API." > I have been working with a specific implementation of, and can envision, a > class of applications that reduce data into a large collection of counters, > perhaps building projections of the data in many dimensions in the process. > One can use Hadoop MapReduce as the engine to accomplish this for a given > data set and use LoadIncrementalHFiles to move the result into place for live > serving. MR is natural for summation over very large counter sets: emit > counter increments for the data set and projections thereof in mappers, use > combiners for partial aggregation, use reducers to do final summation into > HFiles. > However, it is not possible to then merge in a set of updates to an existing > table built in the manner above without either 1) joining the table data and > the update set into a large MR temporary set, followed by a complete rewrite > of the table; or 2) posting all of the updates as Increments via the HBase > API, impacting any other concurrent users of the HBase service, and perhaps > taking 10-100 times longer than if updates could be computed directly into > HFiles like the original import. Both of these alternatives are expensive in > terms of CPU and time; one is also expensive in terms of disk. > I propose adding incremental bulk load support for Increments. Here is a > sketch of a possible implementation: > * Add a KV type for Increment > * Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles > directly to handle the new KV type > * Bulk load API can move the files to be merged into the Stores as before. > * Implement an alternate compaction algorithm or modify the existing. Need to > identify Increments and apply them to an existing most recent version of a > value, or create the value if it does not exist. > ** Use KeyValueHeap as is to merge value-sets by row as before. > ** For each row, use a KV-keyed Map for in memory update of values. > ** If there is an existing value and it is not a serialized long, ignore > the Increment and log at INFO level. > ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with > an appropriate memory limit, so work for overlarge rows will spill to disk. > Can be local disk, not HDFS. > * Never return an Increment KV to a client doing a Get or Scan. > ** Before the merge is complete, if we find an Increment KV when searching > Store files for a value, continue searching back in the Store files until we > find a Put KV for the value, adding up Increments as they are encountered, > then applying them to the Put value; or until search ends, in which case the > Increment is treated as a Put. > ** If there is an existing value and it is not a serialized long, ignore > the Increment and log at INFO level. > * As a beneficial side effect, with Increments as just another KV type we can > unify Put and Increment handling. > Because this is a core concern I'd prefer discussing this as a possible > enhancement of core as opposed to a Coprocessor-based extension. However it > could be possible to implement all but the KV changes within the Coprocessor > framework. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira