Ashu Pachauri created HBASE-18192:
-------------------------------------
Summary: Replication drops recovered queues on region server
shutdown
Key: HBASE-18192
URL: https://issues.apache.org/jira/browse/HBASE-18192
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 1.2.6, 1.3.1, 2.0.0, 1.4.0
Reporter: Ashu Pachauri
Assignee: Ashu Pachauri
Priority: Blocker
Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
When a recovered queue has only one active ReplicationWorkerThread, the
recovered queue is completely dropped on a region server shutdown. This will
happen in situation when
1. DefaultWALProvider is used.
2. RegionGroupingProvider provider is used but replication is stuck on one WAL
group for some reason (for example HBASE-18137)
3. All other replication workers have died due to unhandled exception, and the
only one finishes. This will cause the recovered queue to get deleted without a
regionserver shutdown. This can happen on deployments without fix for
HBASE-17381.
The problematic piece of code is:
{Code}
while (isWorkerActive()){
// The worker thread run loop...
}
if (replicationQueueInfo.isQueueRecovered()) {
// use synchronize to make sure one last thread will clean the queue
synchronized (workerThreads) {
Threads.sleep(100);// wait a short while for other worker thread to
fully exit
boolean allOtherTaskDone = true;
for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
if (!worker.equals(this) && worker.isAlive()) {
allOtherTaskDone = false;
break;
}
}
if (allOtherTaskDone) {
manager.closeRecoveredQueue(this.source);
LOG.info("Finished recovering queue " + peerClusterZnode
+ " with the following stats: " + getStats());
}
}
{Code}
The conceptual issue is that isWorkerActive() tells whether a worker is
currently running or not and it's being used as a proxy for whether a worker
has finished it's work. But, in fact, "Should a worker should exit?" and "Has
completed it's work?" are two different questions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)