[ 
https://issues.apache.org/jira/browse/HBASE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477137#comment-13477137
 ] 

Karthik Ranganathan commented on HBASE-5783:
--------------------------------------------

No, we track only the last (highest) one per region. Also, in the actual 
implementation, we did it with just timestamps from the RS. So, after doing all 
the puts the loader gets the time on the RS (t1). The server tracks the start 
time of the last successfully completed flush {t2). Querying that and making 
sure t2 > t1 is enough. Of course - if the region has moved gracefully, thats 
considered a success too as an optimization.

We used the term "MR Bulk Loader" simply to say that the load of the data 
should be repeatable in case of failure (as opposed to a online use case).
                
> Faster HBase bulk loader
> ------------------------
>
>                 Key: HBASE-5783
>                 URL: https://issues.apache.org/jira/browse/HBASE-5783
>             Project: HBase
>          Issue Type: New Feature
>          Components: Client, IPC/RPC, Performance, regionserver
>            Reporter: Karthik Ranganathan
>            Assignee: Amitanand Aiyer
>
> We can get a 3x to 4x gain based on a prototype demonstrating this approach 
> in effect (hackily) over the MR bulk loader for very large data sets by doing 
> the following:
> 1. Do direct multi-puts from HBase client using GZIP compressed RPC's
> 2. Turn off WAL (we will ensure no data loss in another way)
> 3. For each bulk load client, we need to:
> 3.1 do a put
> 3.2 get back a tracking cookie (memstoreTs or HLogSequenceId) per put
> 3.3 be able to ask the RS if the tracking cookie has been flushed to disk
> 4. For each client, we can succeed it if the tracking cookie for the last put 
> it did (for every RS) makes it to disk. Otherwise the map task fails and is 
> retried.
> 5. If the last put did not make it to disk for a timeout (say a second or so) 
> we issue a manual flush.
> Enhancements:
> - Increase the memstore size so that we flush larger files
> - Decrease the compaction ratios (say increase the number of files to compact)
> Quick background:
> The bottlenecks in the multiput approach are that the data is transferred 
> *uncompressed* twice over the top-of-rack: once from the client to the RS (on 
> the multi put call) and again because of WAL (HDFS replication). We reduced 
> the former with RPC compression and eliminated the latter above while still 
> guaranteeing that data wont be lost.
> This is better than the MR bulk loader at a high level because we dont need 
> to merge sort all the files for a given region and then make it a HFile - 
> thats the equivalent of bulk loading AND majorcompacting in one shot. Also 
> there is much more disk involved in the MR method (sort/spill).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to