[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045400#comment-16045400 ]
Hudson commented on HBASE-18192: -------------------------------- SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #151 (See [https://builds.apache.org/job/HBase-1.2-JDK7/151/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 96e48c3df597fc1450546818e2bd34cfc1fd5c10) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java > Replication drops recovered queues on region server shutdown > ------------------------------------------------------------ > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.3.1, 1.2.6 > Reporter: Ashu Pachauri > Assignee: Ashu Pachauri > Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)