[ https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662793#comment-14662793 ]
Hudson commented on HBASE-12865: -------------------------------- FAILURE: Integrated in HBase-1.2-IT #80 (See [https://builds.apache.org/job/HBase-1.2-IT/80/]) HBASE-12865 WALs may be deleted before they are replicated to peers (He Liangliang) (apurtell: rev 7abb12be26115eda7341b82b9860990a14bc6040) * hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesClient.java * hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationStateBasic.java * hbase-server/src/main/java/org/apache/hadoop/hbase/replication/master/ReplicationLogCleaner.java * hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java * hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesZKImpl.java * hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesClientZKImpl.java > WALs may be deleted before they are replicated to peers > ------------------------------------------------------- > > Key: HBASE-12865 > URL: https://issues.apache.org/jira/browse/HBASE-12865 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Liu Shaohui > Assignee: He Liangliang > Priority: Critical > Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0 > > Attachments: HBASE-12865-V1.diff, HBASE-12865-V2.diff > > > By design, ReplicationLogCleaner guarantee that the WALs being in > replication queue can't been deleted by the HMaster. The > ReplicationLogCleaner gets the WAL set from zookeeper by scanning the > replication zk node. But it may get uncompleted WAL set during replication > failover for the scan operation is not atomic. > For example: There are three region servers: rs1, rs2, rs3, and peer id 10. > The layout of replication zookeeper nodes is: > {code} > /hbase/replication/rs/rs1/10/wals > /rs2/10/wals > /rs3/10/wals > {code} > - t1: the ReplicationLogCleaner finished scanning the replication queue of > rs1, and start to scan the queue of rs2. > - t2: region server rs3 is down, and rs1 take over rs3's replication queue. > The new layout is > {code} > /hbase/replication/rs/rs1/10/wals > /rs1/10-rs3/wals > /rs2/10/wals > /rs3 > {code} > - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start > to scan the node of rs3. But the the queue has been moved to > "replication/rs1/10-rs3/WALS" > So the ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the > hmaster may delete these WALs before they are replicated to peer clusters. > We encountered this problem in our cluster and I think it's a serious bug for > replication. > Suggestions are welcomed to fix this bug. thx~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)