[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

Hudson (JIRA) Fri, 09 Jun 2017 22:34:53 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045400#comment-16045400
 ]


Hudson commented on HBASE-18192:
--------------------------------

SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #151 (See 
[https://builds.apache.org/job/HBase-1.2-JDK7/151/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
96e48c3df597fc1450546818e2bd34cfc1fd5c10)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication drops recovered queues on region server shutdown
> ------------------------------------------------------------
>
>                 Key: HBASE-18192
>                 URL: https://issues.apache.org/jira/browse/HBASE-18192
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1, 1.2.6
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
>         Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
>         // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
>         // use synchronize to make sure one last thread will clean the queue
>         synchronized (workerThreads) {
>           Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>           boolean allOtherTaskDone = true;
>           for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
>             if (!worker.equals(this) && worker.isAlive()) {
>               allOtherTaskDone = false;
>               break;
>             }
>           }
>           if (allOtherTaskDone) {
>             manager.closeRecoveredQueue(this.source);
>             LOG.info("Finished recovering queue " + peerClusterZnode
>                 + " with the following stats: " + getStats());
>           }
>         }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

Reply via email to