[ https://issues.apache.org/jira/browse/HBASE-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang updated HBASE-12770: ------------------------------ Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to master and branch-1. Thanks all for reviewing. > Don't transfer all the queued hlogs of a dead server to the same alive server > ----------------------------------------------------------------------------- > > Key: HBASE-12770 > URL: https://issues.apache.org/jira/browse/HBASE-12770 > Project: HBase > Issue Type: Improvement > Components: Replication > Affects Versions: 2.0.0, 1.4.0 > Reporter: Jianwei Cui > Assignee: Phil Yang > Priority: Minor > Fix For: 2.0.0, 1.4.0 > > Attachments: HBASE-12770-branch-1-v1.patch, > HBASE-12770-branch-1-v2.patch, HBASE-12770-branch-1-v3.patch, > HBASE-12770-branch-1-v3.patch, HBASE-12770-branch-1-v3.patch, > HBASE-12770-branch-1-v3.patch, HBASE-12770-trunk.patch, HBASE-12770-v1.patch, > HBASE-12770-v2.patch, HBASE-12770-v3.patch, HBASE-12770-v3.patch > > > When a region server is down(or the cluster restart), all the hlog queues > will be transferred by the same alive region server. In a shared cluster, we > might create several peers replicating data to different peer clusters. There > might be lots of hlogs queued for these peers caused by several reasons, such > as some peers might be disabled, or errors from peer cluster might prevent > the replication, or the replication sources may fail to read some hlog > because of hdfs problem. Then, if the server is down or restarted, another > alive server will take all the replication jobs of the dead server, this > might bring a big pressure to resources(network/disk read) of the alive > server and also is not fast enough to replicate the queued hlogs. And if the > alive server is down, all the replication jobs including that takes from > other dead servers will once again be totally transferred to another alive > server, this might cause a server have a large number of queued hlogs(in our > shared cluster, we find one server might have thousands of queued hlogs for > replication). As an optional way, is it reasonable that the alive server only > transfer one peer's hlogs from the dead server one time? Then, other alive > region servers might have the opportunity to transfer the hlogs of rest > peers. This may also help the queued hlogs be processed more fast. Any > discussion is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)