[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated GEODE-5155: -- Labels: pull-request-available (was: ) > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > Labels: pull-request-available > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { > dlockService.unlock("testDLock"); > } catch (CancelException | IllegalStateException e) { > // shutting down > } > } > } > } > } > {code} > The problem is that the membership listener in TXCommitMessage is removing > itself from the transaction map in TXFarSideCMTracker without setting any > state that the recovery message can check. The recovery method is waiting > like this: > {code} > synchronized
[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Schuchardt updated GEODE-5155: Affects Version/s: 1.7.0 > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { > dlockService.unlock("testDLock"); > } catch (CancelException | IllegalStateException e) { > // shutting down > } > } > } > } > } > {code} > The problem is that the membership listener in TXCommitMessage is removing > itself from the transaction map in TXFarSideCMTracker without setting any > state that the recovery message can check. The recovery method is waiting > like this: > {code} > synchronized (this.txInProgress) { > mess = (TXCommitMessage)
[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Schuchardt updated GEODE-5155: Issue Type: Bug (was: New Feature) > hang recovering transaction state for crashed server > > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions >Affects Versions: 1.7.0 >Reporter: Bruce Schuchardt >Assignee: Bruce Schuchardt >Priority: Major > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(10)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { > dlockService.unlock("testDLock"); > } catch (CancelException | IllegalStateException e) { > // shutting down > } > } > } > } > } > {code} > The problem is that the membership listener in TXCommitMessage is removing > itself from the transaction map in TXFarSideCMTracker without setting any > state that the recovery message can check. The recovery method is waiting > like this: > {code} > synchronized (this.txInProgress) { > mess = (TXCommitMessage)
[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server
[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Schuchardt updated GEODE-5155: Description: A concourse job failed in DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with two threads stuck in this state: {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 [vm2] java.lang.Thread.State: WAITING [vm2] at java.lang.Object.wait(Native Method) [vm2] - waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6 [vm2] at java.lang.Object.wait(Object.java:502) [vm2] at org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) [vm2] at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) [vm2] at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) [vm2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [vm2] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) [vm2] at java.lang.Thread.run(Thread.java:748) {noformat} I modified the test to tighten up its forcedDisconnect and performOps methods to get transaction recovery to happen more reliably. {code} public void forceDisconnect() throws Exception { Cache existingCache = basicGetCache(); synchronized(commitLock) { committing = false; while (!committing) { commitLock.wait(); } } if (existingCache != null && !existingCache.isClosed()) { DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); } } public void performOps() { Cache cache = getCache(); Region region = cache.getRegion("TestRegion"); DistributedLockService dlockService = DistributedLockService.getServiceNamed("Bulldog"); Random random = new Random(); while (!cache.isClosed()) { boolean locked = false; try { locked = dlockService.lock("testDLock", 500, 60_000); if (!locked) { // this could happen if we're starved out for 30sec by other VMs continue; } cache.getCacheTransactionManager().begin(); region.put("TestKey", "TestValue" + random.nextInt(10)); TXManagerImpl mgr = (TXManagerImpl) getCache().getCacheTransactionManager(); TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); TXState txState = (TXState) txProxy.getRealDeal(null, null); txState.setBeforeSend(() -> { synchronized(commitLock) { committing = true; commitLock.notifyAll(); }}); try { cache.getCacheTransactionManager().commit(); } catch (CommitConflictException e) { throw new RuntimeException("dlock failed to prevent a transaction conflict", e); } int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); } catch (CancelException | IllegalStateException e) { // okay to ignore } finally { if (locked) { try { dlockService.unlock("testDLock"); } catch (CancelException | IllegalStateException e) { // shutting down } } } } } {code} The problem is that the membership listener in TXCommitMessage is removing itself from the transaction map in TXFarSideCMTracker without setting any state that the recovery message can check. The recovery method is waiting like this: {code} synchronized (this.txInProgress) { mess = (TXCommitMessage) this.txInProgress.get(lk); } if (mess != null) { synchronized (mess) { // tx in progress, we must wait until its done while (!mess.wasProcessed()) { try { mess.wait(); } catch (InterruptedException ie) { Thread.currentThread().interrupt(); logger.error(LocalizedMessage.create( LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION, mess), ie); break; } } } {code} We could probably change this method to make sure that the message is still in the map instead of only checking wasProcessed(). was: A concourse job failed in