[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2018-03-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-18192:
--
Fix Version/s: (was: 3.0.0)

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2, 2.0.0
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-21 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Release Note: 
If a region server that is processing recovered queue for another previously 
dead region server is gracefully shut down, it can drop the recovered queue 
under certain conditions. Running without this fix on a 1.2+ release means 
possibility of continuing data loss in replication, irrespective of which 
WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, running without this fix 
will always cause dataloss in replication whenever a region server processing 
recovered queues is gracefully shutdown.

  was:
If a region server that is processing recovered queue for another previously 
dead region server is gracefully shut down, it can drop the recovered queue 
under certain conditions. Running without this fix on a 1.2+ release means 
possibility of continuing data loss in replication, irrespective of which 
WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, this will always cause 
dataloss in replication whenever a region server processing recovered queues is 
gracefully shutdown.


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-21 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Release Note: 
If a region server that is processing recovered queue for another previously 
dead region server is gracefully shut down, it can drop the recovered queue 
under certain conditions. Running without this fix on a 1.2+ release means 
possibility of continuing data loss in replication, irrespective of which 
WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, this will always cause 
dataloss in replication whenever a region server processing recovered queues is 
gracefully shutdown.

  was:
If region server that is processing recovered queue for another previously dead 
region server is gracefully shut down, it can drop the recovered queue under 
certain conditions. Running without this fix on a 1.2+ release means 
possibility of continuing data loss in replication, irrespective of which 
WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, this will always cause 
dataloss in replication whenever a region server processing recovered queues is 
gracefully shutdown.


> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-21 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Release Note: 
If region server that is processing recovered queue for another previously dead 
region server is gracefully shut down, it can drop the recovered queue under 
certain conditions. Running without this fix on a 1.2+ release means 
possibility of continuing data loss in replication, irrespective of which 
WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, this will always cause 
dataloss in replication whenever a region server processing recovered queues is 
gracefully shutdown.

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-18192:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: (was: 2.0.0)
   2.0.0-alpha-2
   3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch, Ashu

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Attachment: (was: HBASE-18192.branch-1.3.001.patch)

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Attachment: (was: HBASE-18192.branch-1.3.002.patch)

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-09 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Attachment: HBASE-18192.master.001.patch
HBASE-18192.branch-1.001.patch
HBASE-18192.branch-1.3.003.patch

Uploading patches for master and branch-1.
Also, uploading a new patch for branch-1.3 which has an extra check for proper 
cleanup of recovered queues. Please use HBASE-18192.branch-1.3.003.patch for 
branch-1.3. 

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.001.patch, 
> HBASE-18192.branch-1.3.001.patch, HBASE-18192.branch-1.3.002.patch, 
> HBASE-18192.branch-1.3.003.patch, HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-08 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Status: Patch Available  (was: Open)

Submitting for QA run.

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.6, 1.3.1
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch, 
> HBASE-18192.branch-1.3.002.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-08 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Attachment: HBASE-18192.branch-1.3.002.patch

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch, 
> HBASE-18192.branch-1.3.002.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-08 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Affects Version/s: (was: 1.4.0)
   (was: 2.0.0)

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown

2017-06-08 Thread Ashu Pachauri (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-18192:
--
Attachment: HBASE-18192.branch-1.3.001.patch

> Replication drops recovered queues on region server shutdown
> 
>
> Key: HBASE-18192
> URL: https://issues.apache.org/jira/browse/HBASE-18192
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.6
>Reporter: Ashu Pachauri
>Assignee: Ashu Pachauri
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18192.branch-1.3.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the 
> recovered queue is completely dropped on a region server shutdown. This will 
> happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one 
> WAL group for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and 
> the only one finishes. This will cause the recovered queue to get deleted 
> without a regionserver shutdown. This can happen on deployments without fix 
> for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
> // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
> // use synchronize to make sure one last thread will clean the queue
> synchronized (workerThreads) {
>   Threads.sleep(100);// wait a short while for other worker thread to 
> fully exit
>   boolean allOtherTaskDone = true;
>   for (ReplicationSourceWorkerThread worker : workerThreads.values()) 
> {
> if (!worker.equals(this) && worker.isAlive()) {
>   allOtherTaskDone = false;
>   break;
> }
>   }
>   if (allOtherTaskDone) {
> manager.closeRecoveredQueue(this.source);
> LOG.info("Finished recovering queue " + peerClusterZnode
> + " with the following stats: " + getStats());
>   }
> }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is 
> currently running or not and it's being used as a proxy for whether a worker 
> has finished it's work. But, in fact, "Should a worker should exit?" and "Has 
> completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)