[jira] [Commented] (HBASE-12865) WALs may be deleted before they are replicated to peers

Hudson (JIRA) Fri, 07 Aug 2015 21:12:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662793#comment-14662793
 ]


Hudson commented on HBASE-12865:
--------------------------------

FAILURE: Integrated in HBase-1.2-IT #80 (See 
[https://builds.apache.org/job/HBase-1.2-IT/80/])
HBASE-12865 WALs may be deleted before they are replicated to peers (He 
Liangliang) (apurtell: rev 7abb12be26115eda7341b82b9860990a14bc6040)
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesClient.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationStateBasic.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/master/ReplicationLogCleaner.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesZKImpl.java
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesClientZKImpl.java


> WALs may be deleted before they are replicated to peers
> -------------------------------------------------------
>
>                 Key: HBASE-12865
>                 URL: https://issues.apache.org/jira/browse/HBASE-12865
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Liu Shaohui
>            Assignee: He Liangliang
>            Priority: Critical
>             Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0
>
>         Attachments: HBASE-12865-V1.diff, HBASE-12865-V2.diff
>
>
> By design, ReplicationLogCleaner guarantee that the WALs  being in 
> replication queue can't been deleted by the HMaster. The 
> ReplicationLogCleaner gets the WAL set from zookeeper by scanning the 
> replication zk node. But it may get uncompleted WAL set during replication 
> failover for the scan operation is not atomic.
> For example: There are three region servers: rs1, rs2, rs3, and peer id 10.  
> The layout of replication zookeeper nodes is:
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs2/10/wals
>                      /rs3/10/wals
> {code}
> - t1: the ReplicationLogCleaner finished scanning the replication queue of 
> rs1, and start to scan the queue of rs2.
> - t2: region server rs3 is down, and rs1 take over rs3's replication queue. 
> The new layout is
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs1/10-rs3/wals
>                      /rs2/10/wals
>                      /rs3
> {code}
> - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start 
> to scan the node of rs3. But the the queue has been moved to  
> "replication/rs1/10-rs3/WALS"
> So the  ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the 
> hmaster may delete these WALs before they are replicated to peer clusters.
> We encountered this problem in our cluster and I think it's a serious bug for 
> replication.
> Suggestions are welcomed to fix this bug. thx~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12865) WALs may be deleted before they are replicated to peers

Reply via email to