[ https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721380#comment-14721380 ]
Lars Hofhansl commented on HBASE-13153: --------------------------------------- Thanks for thinking about this. Generally it might be worth considering doing this a layer above HBase. I.e. some code will generate a set of HFile to be bulk loaded. Before the actual bulk load happens we could ship the HFiles to the slave cluster and do the bulk loading there (just the loading, not the generation of the files)... This just as a general comment. bq. Replication module will be one of the BulkLoad Actions Listener, so it will get notification about newly added hfiles along with their hdfs paths. What if that notification is missed? For example the RS dies just then? WAL replication does not have this issue since it always deals with all existing WALs so it cannot miss anything. bq.. HFileReplicationEndPoint will maintain a queue of hfiles. After every configurable interval or max request size limit, it will send a RPC request to peer cluster RS with all queued entries. So you'll send the HFile over RPCs? These files can be huge. Can we use HDFS' distCP here? bq. HFileReplicationEndPoint will maintain a queue of hfiles. After every configurable interval or max request size limit, it will send a RPC request to peer cluster RS with all queued entries. Can we simply use the standard bulk load mechanism here? It would split the files as necessary. bq. The hfile should not get deleted from archive folder until the replication is finished. You'll need to ensure this somehow. bq. Cyclic replication: There will not be any data validation for cyclic case. That can lead to very tricky issues where the same files just go from cluster to cluster in a never ending cycle. We know at the source that the HFiles came from a bulk load, maybe we can handle that specially. Lastly, it might be generally a good option to copy HFiles around, rather than WALs (at least for some setups). Could we use this to do that? > enable bulkload to support replication > -------------------------------------- > > Key: HBASE-13153 > URL: https://issues.apache.org/jira/browse/HBASE-13153 > Project: HBase > Issue Type: New Feature > Components: Replication > Reporter: sunhaitao > Assignee: Ashish Singhi > Fix For: 2.0.0 > > Attachments: HBase Bulk Load Replication.pdf > > > Currently we plan to use HBase Replication feature to deal with disaster > tolerance scenario.But we encounter an issue that we will use bulkload very > frequently,because bulkload bypass write path, and will not generate WAL, so > the data will not be replicated to backup cluster. It's inappropriate to > bukload twice both on active cluster and backup cluster. So i advise do some > modification to bulkload feature to enable bukload to both active cluster and > backup cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)