[ https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Liangliang updated HBASE-12865: ---------------------------------- Attachment: HBASE-12865-V1.diff > WALs may be deleted before they are replicated to peers > ------------------------------------------------------- > > Key: HBASE-12865 > URL: https://issues.apache.org/jira/browse/HBASE-12865 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Liu Shaohui > Assignee: He Liangliang > Priority: Critical > Attachments: HBASE-12865-V1.diff > > > By design, ReplicationLogCleaner guarantee that the WALs being in > replication queue can't been deleted by the HMaster. The > ReplicationLogCleaner gets the WAL set from zookeeper by scanning the > replication zk node. But it may get uncompleted WAL set during replication > failover for the scan operation is not atomic. > For example: There are three region servers: rs1, rs2, rs3, and peer id 10. > The layout of replication zookeeper nodes is: > {code} > /hbase/replication/rs/rs1/10/wals > /rs2/10/wals > /rs3/10/wals > {code} > - t1: the ReplicationLogCleaner finished scanning the replication queue of > rs1, and start to scan the queue of rs2. > - t2: region server rs3 is down, and rs1 take over rs3's replication queue. > The new layout is > {code} > /hbase/replication/rs/rs1/10/wals > /rs1/10-rs3/wals > /rs2/10/wals > /rs3 > {code} > - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start > to scan the node of rs3. But the the queue has been moved to > "replication/rs1/10-rs3/WALS" > So the ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the > hmaster may delete these WALs before they are replicated to peer clusters. > We encountered this problem in our cluster and I think it's a serious bug for > replication. > Suggestions are welcomed to fix this bug. thx~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)