[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461340#comment-16461340 ] ASF subversion and git services commented on GEODE-5155: Commit 2c1b8a4edd99c3b5d25697a08b917de3310c31ae in geode's branch refs/heads/develop from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2c1b8a4 ] GEODE-5155 hang recovering transaction state for crashed server fixing an NPE caused by the test creating a new cache in between starting a transaction and installing a test hook. > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Time Spent: 40m > Remaining Estimate: 0h > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { >
[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460120#comment-16460120 ] ASF subversion and git services commented on GEODE-5155: Commit 2a02923a2fe3ead102ff79e76c47935e84f76859 in geode's branch refs/heads/develop from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2a02923 ] GEODE-5155 hang recovering transaction state for crashed server The fix is to check to see if the message was removed due to a memberDeparted event. When that happens a departureNoticed flag was being set in TXCommitMessage but the wait loop in the transaction tracker wasn't checking this flag. > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > //
[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459170#comment-16459170 ] ASF subversion and git services commented on GEODE-5155: Commit e1688b6ab99fb4abca5be8e260eb0380eba33695 in geode's branch refs/heads/feature/GEODE-5155 from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=e1688b6 ] GEODE-5155 hang recovering transaction state for crashed server renaming variables to make the code more readable > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { > dlockService.unlock("testDLock"); > } catch (CancelException | IllegalStateException e) { > // shutting down > } > } > } > } > }
[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459166#comment-16459166 ] ASF subversion and git services commented on GEODE-5155: Commit 1a9ee1f198877d6d715c9563f2d68cfb318ba88b in geode's branch refs/heads/feature/GEODE-5155 from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=1a9ee1f ] GEODE-5155 hang recovering transaction state for crashed server oops - missed a change > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { > dlockService.unlock("testDLock"); > } catch (CancelException | IllegalStateException e) { > // shutting down > } > } > } > } > } > {code} > The problem is t