[ https://issues.apache.org/jira/browse/HBASE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595531#comment-13595531 ]
Anoop Sam John commented on HBASE-5783: --------------------------------------- [~amitanand] This issue now marked as fixed? This is fixed in which version? 89fb? How can I see a patch for this? > Faster HBase bulk loader > ------------------------ > > Key: HBASE-5783 > URL: https://issues.apache.org/jira/browse/HBASE-5783 > Project: HBase > Issue Type: New Feature > Components: Client, IPC/RPC, Performance, regionserver > Reporter: Karthik Ranganathan > Assignee: Amitanand Aiyer > > We can get a 3x to 4x gain based on a prototype demonstrating this approach > in effect (hackily) over the MR bulk loader for very large data sets by doing > the following: > 1. Do direct multi-puts from HBase client using GZIP compressed RPC's > 2. Turn off WAL (we will ensure no data loss in another way) > 3. For each bulk load client, we need to: > 3.1 do a put > 3.2 get back a tracking cookie (memstoreTs or HLogSequenceId) per put > 3.3 be able to ask the RS if the tracking cookie has been flushed to disk > 4. For each client, we can succeed it if the tracking cookie for the last put > it did (for every RS) makes it to disk. Otherwise the map task fails and is > retried. > 5. If the last put did not make it to disk for a timeout (say a second or so) > we issue a manual flush. > Enhancements: > - Increase the memstore size so that we flush larger files > - Decrease the compaction ratios (say increase the number of files to compact) > Quick background: > The bottlenecks in the multiput approach are that the data is transferred > *uncompressed* twice over the top-of-rack: once from the client to the RS (on > the multi put call) and again because of WAL (HDFS replication). We reduced > the former with RPC compression and eliminated the latter above while still > guaranteeing that data wont be lost. > This is better than the MR bulk loader at a high level because we dont need > to merge sort all the files for a given region and then make it a HFile - > thats the equivalent of bulk loading AND majorcompacting in one shot. Also > there is much more disk involved in the MR method (sort/spill). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira