[ https://issues.apache.org/jira/browse/HBASE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897436#comment-13897436 ]
Hudson commented on HBASE-10482: -------------------------------- FAILURE: Integrated in HBase-0.94 #1281 (See [https://builds.apache.org/job/HBase-0.94/1281/]) HBASE-10482 ReplicationSyncUp doesn't clean up its ZK, needed for tests (jdcryans: rev 1566855) * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSyncUp.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSyncUpTool.java > ReplicationSyncUp doesn't clean up its ZK, needed for tests > ----------------------------------------------------------- > > Key: HBASE-10482 > URL: https://issues.apache.org/jira/browse/HBASE-10482 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 0.96.1, 0.94.16 > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Fix For: 0.98.1, 0.99.0, 0.94.17 > > Attachments: HBASE-10249.patch > > > TestReplicationSyncUpTool failed again: > https://builds.apache.org/job/HBase-TRUNK/4895/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationSyncUpTool/testSyncUpTool/ > It's not super obvious why only one of the two tables is replicated, the test > could use some more logging, but I understand it this way: > The first ReplicationSyncUp gets started and for some reason it cannot > replicate the data: > {noformat} > 2014-02-06 21:32:19,811 INFO [Thread-1372] > regionserver.ReplicationSourceManager(203): Current list of replicators: > [1391722339091.SyncUpTool.replication.org,1234,1, > quirinus.apache.org,37045,1391722237951, > quirinus.apache.org,33502,1391722238125] other RSs: [] > 2014-02-06 21:32:19,811 INFO [Thread-1372.replicationSource,1] > regionserver.ReplicationSource(231): Replicating > db42e7fc-7f29-4038-9292-d85ea8b9994b -> 783c0ab2-4ff9-4dc0-bb38-86bf31d1d817 > 2014-02-06 21:32:19,892 TRACE [Thread-1372.replicationSource,2] > regionserver.ReplicationSource(596): No log to process, sleeping 100 times 1 > 2014-02-06 21:32:19,911 TRACE [Thread-1372.replicationSource,1] > regionserver.ReplicationSource(596): No log to process, sleeping 100 times 1 > 2014-02-06 21:32:20,094 TRACE [Thread-1372.replicationSource,2] > regionserver.ReplicationSource(596): No log to process, sleeping 100 times 2 > ... > 2014-02-06 21:32:23,414 TRACE [Thread-1372.replicationSource,1] > regionserver.ReplicationSource(596): No log to process, sleeping 100 times 8 > 2014-02-06 21:32:23,673 INFO [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl(169): Moving > quirinus.apache.org,37045,1391722237951's hlogs to my queue > 2014-02-06 21:32:23,768 DEBUG [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl(396): Creating > quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803 > 2014-02-06 21:32:23,842 DEBUG [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl(396): Creating > quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803 > 2014-02-06 21:32:24,297 TRACE [Thread-1372.replicationSource,2] > regionserver.ReplicationSource(596): No log to process, sleeping 100 times 9 > 2014-02-06 21:32:24,314 TRACE [Thread-1372.replicationSource,1] > regionserver.ReplicationSource(596): No log to process, sleeping 100 times 9 > {noformat} > Finally it gives up: > {noformat} > 2014-02-06 21:32:30,873 DEBUG [Thread-1372] > replication.TestReplicationSyncUpTool(323): SyncUpAfterDelete failed at retry > = 0, with rowCount_ht1TargetPeer1 =100 and rowCount_ht2TargetAtPeer1 =200 > {noformat} > The syncUp tool has an ID you can follow, grep for > syncupReplication1391722338885 or just the timestamp, and you can see it > doing things after that. The reason is that the tool closes the > ReplicationSourceManager but not the ZK connection, so events _still_ come in > and NodeFailoverWorker _still_ tries to recover queues but then there's > nothing to process them. > Later in the logs you can see: > {noformat} > 2014-02-06 21:32:37,381 INFO [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl(169): Moving > quirinus.apache.org,33502,1391722238125's hlogs to my queue > 2014-02-06 21:32:37,567 INFO [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl(239): Won't transfer the queue, another > RS took care of it because of: KeeperErrorCode = NoNode for > /1/replication/rs/quirinus.apache.org,33502,1391722238125/lock > {noformat} > There shouldn't' be any racing, but now someone already moved > "quirinus.apache.org,33502,1391722238125" away. > FWIW I can't even make the test fail on my machine so I'm not 100% sure > closing the ZK connection fixes the issue, but at least it's the right thing > to do. -- This message was sent by Atlassian JIRA (v6.1.5#6160)