[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045453#comment-16045453 ] Hudson commented on HBASE-18192: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3168 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3168/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev eb2dc5d2a524f816fc5cf707b853117bc6ada01a) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSourceShipperThread.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipperThread.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSource.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045414#comment-16045414 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.3-JDK7 #181 (See [https://builds.apache.org/job/HBase-1.3-JDK7/181/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 6a216c787a6099dfd90f7733d574069ea866a708) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045413#comment-16045413 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.3-JDK8 #195 (See [https://builds.apache.org/job/HBase-1.3-JDK8/195/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 6a216c787a6099dfd90f7733d574069ea866a708) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045406#comment-16045406 ] Hudson commented on HBASE-18192: FAILURE: Integrated in Jenkins build HBase-2.0 #19 (See [https://builds.apache.org/job/HBase-2.0/19/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 1aedc07b528876111bfd80cd7de799358144dbb5) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSourceShipperThread.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipperThread.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045405#comment-16045405 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.4 #770 (See [https://builds.apache.org/job/HBase-1.4/770/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 6e3da5a39a21c75de5d0dff9edbe767232a20310) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045403#comment-16045403 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #147 (See [https://builds.apache.org/job/HBase-1.2-JDK8/147/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 96e48c3df597fc1450546818e2bd34cfc1fd5c10) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045400#comment-16045400 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #151 (See [https://builds.apache.org/job/HBase-1.2-JDK7/151/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 96e48c3df597fc1450546818e2bd34cfc1fd5c10) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045354#comment-16045354 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.2-IT #884 (See [https://builds.apache.org/job/HBase-1.2-IT/884/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 96e48c3df597fc1450546818e2bd34cfc1fd5c10) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045349#comment-16045349 ] Hudson commented on HBASE-18192: SUCCESS: Integrated in Jenkins build HBase-1.3-IT #63 (See [https://builds.apache.org/job/HBase-1.3-IT/63/]) HBASE-18192: Replication drops recovered queues on region server (tedyu: rev 6a216c787a6099dfd90f7733d574069ea866a708) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2 > > Attachments: HBASE-18192.branch-1.001.patch, > HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045338#comment-16045338 ] Hadoop QA commented on HBASE-18192: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 58s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 29m 1s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 112m 52s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 155m 54s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Timed out junit tests | org.apache.hadoop.hbase.coprocessor.TestCoprocessorMetrics | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:757bf37 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12872383/HBASE-18192.master.001.patch | | JIRA Issue | HBASE-18192 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 26361e580bba 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / e5ea457 | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/7164/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/7164/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/7164/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/7164/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > Replication drops recovered queues on region server shutdown > ---
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044537#comment-16044537 ] Ted Yu commented on HBASE-18192: Mind attaching patch for master branch ? > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7 > > Attachments: HBASE-18192.branch-1.3.001.patch, > HBASE-18192.branch-1.3.002.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043821#comment-16043821 ] Ted Yu commented on HBASE-18192: lgtm > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7 > > Attachments: HBASE-18192.branch-1.3.001.patch, > HBASE-18192.branch-1.3.002.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043798#comment-16043798 ] Hadoop QA commented on HBASE-18192: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 4s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 19s {color} | {color:green} branch-1.3 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} branch-1.3 passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} branch-1.3 passed with JDK v1.7.0_131 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 59s {color} | {color:green} branch-1.3 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s {color} | {color:green} branch-1.3 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 5s {color} | {color:red} hbase-server in branch-1.3 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s {color} | {color:green} branch-1.3 passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s {color} | {color:green} branch-1.3 passed with JDK v1.7.0_131 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s {color} | {color:green} the patch passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} the patch passed with JDK v1.7.0_131 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 58s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 17m 51s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s {color} | {color:green} the patch passed with JDK v1.7.0_131 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 89m 20s {color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 124m 2s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:9ba21e3 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12872157/HBASE-18192.branch-1.3.002.patch | | JIRA Issue | HBASE-18192 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 898c81d294f9 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1.3 / 4227757 | | Default Java | 1.7.0_131 | | Multi-JDK ve
[jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
[ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042270#comment-16042270 ] Ashu Pachauri commented on HBASE-18192: --- Uploading patch for branch-1.3 for review. I'll update patches for other branches after review because codebase has diverged and branch-1.3 patch won't cleanly apply. > Replication drops recovered queues on region server shutdown > > > Key: HBASE-18192 > URL: https://issues.apache.org/jira/browse/HBASE-18192 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.6 >Reporter: Ashu Pachauri >Assignee: Ashu Pachauri >Priority: Blocker > Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7 > > Attachments: HBASE-18192.branch-1.3.001.patch > > > When a recovered queue has only one active ReplicationWorkerThread, the > recovered queue is completely dropped on a region server shutdown. This will > happen in situation when > 1. DefaultWALProvider is used. > 2. RegionGroupingProvider provider is used but replication is stuck on one > WAL group for some reason (for example HBASE-18137) > 3. All other replication workers have died due to unhandled exception, and > the only one finishes. This will cause the recovered queue to get deleted > without a regionserver shutdown. This can happen on deployments without fix > for HBASE-17381. > The problematic piece of code is: > {Code} > while (isWorkerActive()){ > // The worker thread run loop... > } > if (replicationQueueInfo.isQueueRecovered()) { > // use synchronize to make sure one last thread will clean the queue > synchronized (workerThreads) { > Threads.sleep(100);// wait a short while for other worker thread to > fully exit > boolean allOtherTaskDone = true; > for (ReplicationSourceWorkerThread worker : workerThreads.values()) > { > if (!worker.equals(this) && worker.isAlive()) { > allOtherTaskDone = false; > break; > } > } > if (allOtherTaskDone) { > manager.closeRecoveredQueue(this.source); > LOG.info("Finished recovering queue " + peerClusterZnode > + " with the following stats: " + getStats()); > } > } > {Code} > The conceptual issue is that isWorkerActive() tells whether a worker is > currently running or not and it's being used as a proxy for whether a worker > has finished it's work. But, in fact, "Should a worker should exit?" and "Has > completed it's work?" are two different questions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)