[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
----------------------------------
    Attachment: HBASE-18192.master.001.patch
                HBASE-18192.branch-1.001.patch
                HBASE-18192.branch-1.3.003.patch

Uploading patches for master and branch-1.
Also, uploading a new patch for branch-1.3 which has an extra check for proper 
cleanup of recovered queues. Please use HBASE-18192.branch-1.3.003.patch for 
branch-1.3. 

> Replication drops recovered queues on region server shutdown
> ------------------------------------------------------------
>
>                 Key: HBASE-18192
>                 URL: https://issues.apache.org/jira/browse/HBASE-18192
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1, 1.2.6
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
>         Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.001.patch, HBASE-18192.branch-1.3.002.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
>         // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
>         // use synchronize to make sure one last thread will clean the queue
>         synchronized (workerThreads) {
>           Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>           boolean allOtherTaskDone = true;
>           for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
>             if (!worker.equals(this) && worker.isAlive()) {
>               allOtherTaskDone = false;
>               break;
>             }
>           }
>           if (allOtherTaskDone) {
>             manager.closeRecoveredQueue(this.source);
>             LOG.info("Finished recovering queue " + peerClusterZnode
>                 + " with the following stats: " + getStats());
>           }
>         }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to