[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server

2018-05-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461340#comment-16461340
 ] 

ASF subversion and git services commented on GEODE-5155:


Commit 2c1b8a4edd99c3b5d25697a08b917de3310c31ae in geode's branch 
refs/heads/develop from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=2c1b8a4 ]

GEODE-5155 hang recovering transaction state for crashed server

fixing an NPE caused by the test creating a new cache in between
starting a transaction and installing a test hook.


> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // okay to ignore
>   } finally {
> if (locked) {
>   try {
> 

[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server

2018-05-01 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460120#comment-16460120
 ] 

ASF subversion and git services commented on GEODE-5155:


Commit 2a02923a2fe3ead102ff79e76c47935e84f76859 in geode's branch 
refs/heads/develop from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=2a02923 ]

GEODE-5155 hang recovering transaction state for crashed server

The fix is to check to see if the message was removed due to a
memberDeparted event.

When that happens a departureNoticed flag was being set in TXCommitMessage
but the wait loop in the transaction tracker wasn't checking this flag.


> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // 

[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server

2018-04-30 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459170#comment-16459170
 ] 

ASF subversion and git services commented on GEODE-5155:


Commit e1688b6ab99fb4abca5be8e260eb0380eba33695 in geode's branch 
refs/heads/feature/GEODE-5155 from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=e1688b6 ]

GEODE-5155 hang recovering transaction state for crashed server

renaming variables to make the code more readable


> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // okay to ignore
>   } finally {
> if (locked) {
>   try {
> dlockService.unlock("testDLock");
>   } catch (CancelException | IllegalStateException e) {
> // shutting down
>   }
> }
>   }
> }
>   }

[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server

2018-04-30 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459166#comment-16459166
 ] 

ASF subversion and git services commented on GEODE-5155:


Commit 1a9ee1f198877d6d715c9563f2d68cfb318ba88b in geode's branch 
refs/heads/feature/GEODE-5155 from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=1a9ee1f ]

GEODE-5155 hang recovering transaction state for crashed server

oops - missed a change


> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // okay to ignore
>   } finally {
> if (locked) {
>   try {
> dlockService.unlock("testDLock");
>   } catch (CancelException | IllegalStateException e) {
> // shutting down
>   }
> }
>   }
> }
>   }
> {code}
> The problem is t