[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashu Pachauri updated HBASE-18192: ---------------------------------- Attachment: HBASE-18192.master.001.patch HBASE-18192.branch-1.001.patch HBASE-18192.branch-1.3.003.patch Uploading patches for master and branch-1. Also, uploading a new patch for branch-1.3 which has an extra check for proper cleanup of recovered queues. Please use HBASE-18192.branch-1.3.003.patch for branch-1.3. > Replication drops recovered queues on region server shutdown > ------------------------------------------------------------ > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.3.1, 1.2.6 > Reporter: Ashu Pachauri > Assignee: Ashu Pachauri > Priority: Blocker > Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.001.patch, HBASE-18192.branch-1.3.002.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)