[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045453#comment-16045453
 ] 

Hudson commented on HBASE-18192:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3168 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/3168/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
eb2dc5d2a524f816fc5cf707b853117bc6ada01a)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSourceShipperThread.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipperThread.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSource.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045414#comment-16045414
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.3-JDK7 #181 (See 
[https://builds.apache.org/job/HBase-1.3-JDK7/181/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
6a216c787a6099dfd90f7733d574069ea866a708)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045413#comment-16045413
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.3-JDK8 #195 (See 
[https://builds.apache.org/job/HBase-1.3-JDK8/195/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
6a216c787a6099dfd90f7733d574069ea866a708)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045406#comment-16045406
 ] 

Hudson commented on HBASE-18192:


FAILURE: Integrated in Jenkins build HBase-2.0 #19 (See 
[https://builds.apache.org/job/HBase-2.0/19/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
1aedc07b528876111bfd80cd7de799358144dbb5)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSourceShipperThread.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipperThread.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045405#comment-16045405
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.4 #770 (See 
[https://builds.apache.org/job/HBase-1.4/770/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
6e3da5a39a21c75de5d0dff9edbe767232a20310)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045403#comment-16045403
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #147 (See 
[https://builds.apache.org/job/HBase-1.2-JDK8/147/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
96e48c3df597fc1450546818e2bd34cfc1fd5c10)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045400#comment-16045400
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #151 (See 
[https://builds.apache.org/job/HBase-1.2-JDK7/151/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
96e48c3df597fc1450546818e2bd34cfc1fd5c10)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045354#comment-16045354
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.2-IT #884 (See 
[https://builds.apache.org/job/HBase-1.2-IT/884/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
96e48c3df597fc1450546818e2bd34cfc1fd5c10)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045349#comment-16045349
 ] 

Hudson commented on HBASE-18192:


SUCCESS: Integrated in Jenkins build HBase-1.3-IT #63 (See 
[https://builds.apache.org/job/HBase-1.3-IT/63/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 
6a216c787a6099dfd90f7733d574069ea866a708)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045338#comment-16045338
 ] 

Hadoop QA commented on HBASE-18192:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
58s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
50s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
55s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
44s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
48s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
29m 1s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
59s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 112m 52s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 155m 54s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Timed out junit tests | 
org.apache.hadoop.hbase.coprocessor.TestCoprocessorMetrics |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:757bf37 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12872383/HBASE-18192.master.001.patch
 |
| JIRA Issue | HBASE-18192 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 26361e580bba 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 
15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / e5ea457 |
| Default Java | 1.8.0_131 |
| findbugs | v3.0.0 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7164/artifact/patchprocess/patch-unit-hbase-server.txt
 |
| unit test logs |  
https://builds.apache.org/job/PreCommit-HBASE-Build/7164/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7164/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7164/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> Replication drops recovered queues on region server shutdown
> ---

[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044537#comment-16044537
 ] 

Ted Yu commented on HBASE-18192:


Mind attaching patch for master branch ?

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch, 
> HBASE-18192.branch-1.3.002.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-08 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043821#comment-16043821
 ] 

Ted Yu commented on HBASE-18192:


lgtm

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch, 
> HBASE-18192.branch-1.3.002.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043798#comment-16043798
 ] 

Hadoop QA commented on HBASE-18192:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 4s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
19s {color} | {color:green} branch-1.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} branch-1.3 passed with JDK v1.8.0_131 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} branch-1.3 passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
59s {color} | {color:green} branch-1.3 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
19s {color} | {color:green} branch-1.3 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 5s 
{color} | {color:red} hbase-server in branch-1.3 has 1 extant Findbugs 
warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} branch-1.3 passed with JDK v1.8.0_131 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} branch-1.3 passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
49s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_131 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
58s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
17m 51s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_131 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 89m 20s 
{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 124m 2s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:9ba21e3 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12872157/HBASE-18192.branch-1.3.002.patch
 |
| JIRA Issue | HBASE-18192 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 898c81d294f9 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 
15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | branch-1.3 / 4227757 |
| Default Java | 1.7.0_131 |
| Multi-JDK ve

[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-07 Thread Ashu Pachauri (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042270#comment-16042270
 ] 

Ashu Pachauri commented on HBASE-18192:
---

Uploading patch for branch-1.3 for review. I'll update patches for other 
branches after review because codebase has diverged and branch-1.3 patch won't 
cleanly apply.

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)