[ 
https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042453#comment-15042453
 ] 

Jerry He commented on HBASE-13153:
----------------------------------

I have a use case that this feature would be quite useful. 
We have a SQL on Hadoop/HBase.  When inserting into HBase, we try to be smart 
and optimize using bulk load some times.
For example, when doing 'INSERT INTO my-hbase-table SELECT  col1 from table1', 
we will try to see if the cardinalities are big (say > 20000). If yes, we will 
generate hfile to bulk load, not running table puts.
The problem is that replication will not kick in for this new data.  
For across cluster bulk load, people would probably use an external tool (e.g 
distCp) to move the MR generated hfiles to the target cluster. 
But in this case, it would be difficult to save and transport the hfiles for 
bulk load to the peer cluster since they are generated on-the-fly inside the 
SQL engine.
So this is a good feature to have.

Regarding the network latency and impact on HBase instances, I think we should 
add notes/best practice/warning in the release notes. Mention that potentially 
large files need to copied over the network by HBase handlers, and potential 
impact on the source and peer clusters. And recommendations like the rpc 
timeout values need to be increased.

> Bulk Loaded HFile Replication
> -----------------------------
>
>                 Key: HBASE-13153
>                 URL: https://issues.apache.org/jira/browse/HBASE-13153
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>            Reporter: sunhaitao
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-13153-branch-1-v18.patch, HBASE-13153-v1.patch, 
> HBASE-13153-v10.patch, HBASE-13153-v11.patch, HBASE-13153-v12.patch, 
> HBASE-13153-v13.patch, HBASE-13153-v14.patch, HBASE-13153-v15.patch, 
> HBASE-13153-v16.patch, HBASE-13153-v17.patch, HBASE-13153-v18.patch, 
> HBASE-13153-v2.patch, HBASE-13153-v3.patch, HBASE-13153-v4.patch, 
> HBASE-13153-v5.patch, HBASE-13153-v6.patch, HBASE-13153-v7.patch, 
> HBASE-13153-v8.patch, HBASE-13153-v9.patch, HBASE-13153.patch, HBase Bulk 
> Load Replication-v1-1.pdf, HBase Bulk Load Replication-v2.pdf, HBase Bulk 
> Load Replication-v3.pdf, HBase Bulk Load Replication.pdf, HDFS_HA_Solution.PNG
>
>
> Currently we plan to use HBase Replication feature to deal with disaster 
> tolerance scenario.But we encounter an issue that we will use bulkload very 
> frequently,because bulkload bypass write path, and will not generate WAL, so 
> the data will not be replicated to backup cluster. It's inappropriate to 
> bukload twice both on active cluster and backup cluster. So i advise do some 
> modification to bulkload feature to enable bukload to both active cluster and 
> backup cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to