[ 
https://issues.apache.org/jira/browse/HBASE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-3936:
---------------------------------

    Fix Version/s:     (was: 0.94.0)
                   0.96.0
    
> Incremental bulk load support for Increments
> --------------------------------------------
>
>                 Key: HBASE-3936
>                 URL: https://issues.apache.org/jira/browse/HBASE-3936
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>             Fix For: 0.96.0
>
>
> From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a 
> MapReduce job to output table data in HBase's internal data format, and then 
> directly loads the data files into a running cluster. Using bulk load will 
> use less CPU and network than going via the HBase API."
> I have been working with a specific implementation of, and can envision, a 
> class of applications that reduce data into a large collection of counters, 
> perhaps building projections of the data in many dimensions in the process. 
> One can use Hadoop MapReduce as the engine to accomplish this for a given 
> data set and use LoadIncrementalHFiles to move the result into place for live 
> serving. MR is natural for summation over very large counter sets: emit 
> counter increments for the data set and projections thereof in mappers, use 
> combiners for partial aggregation, use reducers to do final summation into 
> HFiles.
> However, it is not possible to then merge in a set of updates to an existing 
> table built in the manner above without either 1) joining the table data and 
> the update set into a large MR temporary set, followed by a complete rewrite 
> of the table; or 2) posting all of the updates as Increments via the HBase 
> API, impacting any other concurrent users of the HBase service, and perhaps 
> taking 10-100 times longer than if updates could be computed directly into 
> HFiles like the original import. Both of these alternatives are expensive in 
> terms of CPU and time; one is also expensive in terms of disk.
> I propose adding incremental bulk load support for Increments. Here is a 
> sketch of a possible implementation:
> * Add a KV type for Increment
> * Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles 
> directly to handle the new KV type
> * Bulk load API can move the files to be merged into the Stores as before.
> * Implement an alternate compaction algorithm or modify the existing. Need to 
> identify Increments and apply them to an existing most recent version of a 
> value, or create the value if it does not exist.
>   ** Use KeyValueHeap as is to merge value-sets by row as before.
>   ** For each row, use a KV-keyed Map for in memory update of values.
>   ** If there is an existing value and it is not a serialized long, ignore 
> the Increment and log at INFO level.
>   ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with 
> an appropriate memory limit, so work for overlarge rows will spill to disk. 
> Can be local disk, not HDFS.
> * Never return an Increment KV to a client doing a Get or Scan. 
>   ** Before the merge is complete, if we find an Increment KV when searching 
> Store files for a value, continue searching back in the Store files until we 
> find a Put KV for the value, adding up Increments as they are encountered, 
> then applying them to the Put value; or until search ends, in which case the 
> Increment is treated as a Put.
>   ** If there is an existing value and it is not a serialized long, ignore 
> the Increment and log at INFO level.
> * As a beneficial side effect, with Increments as just another KV type we can 
> unify Put and Increment handling.
> Because this is a core concern I'd prefer discussing this as a possible 
> enhancement of core as opposed to a Coprocessor-based extension. However it 
> could be possible to implement all but the KV changes within the Coprocessor 
> framework.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to