[ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461340#comment-16461340 ]
ASF subversion and git services commented on GEODE-5155: -------------------------------------------------------- Commit 2c1b8a4edd99c3b5d25697a08b917de3310c31ae in geode's branch refs/heads/develop from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2c1b8a4 ] GEODE-5155 hang recovering transaction state for crashed server fixing an NPE caused by the test creating a new cache in between starting a transaction and installing a test hook. > hang recovering transaction state for crashed server > ---------------------------------------------------- > > Key: GEODE-5155 > URL: https://issues.apache.org/jira/browse/GEODE-5155 > Project: Geode > Issue Type: Bug > Components: distributed lock service, transactions > Affects Versions: 1.7.0 > Reporter: Bruce Schuchardt > Assignee: Bruce Schuchardt > Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Time Spent: 40m > Remaining Estimate: 0h > > A concourse job failed in > DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with > two threads stuck in this state: > {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71 > [vm2] java.lang.Thread.State: WAITING > [vm2] at java.lang.Object.wait(Native Method) > [vm2] - waiting on > org.apache.geode.internal.cache.TXCommitMessage@2105ce6 > [vm2] at java.lang.Object.wait(Object.java:502) > [vm2] at > org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) > [vm2] at > org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) > [vm2] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [vm2] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) > [vm2] at > org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) > [vm2] at java.lang.Thread.run(Thread.java:748) > {noformat} > I modified the test to tighten up its forcedDisconnect and performOps methods > to get transaction recovery to happen more reliably. > {code} > public void forceDisconnect() throws Exception { > Cache existingCache = basicGetCache(); > synchronized(commitLock) { > committing = false; > while (!committing) { > commitLock.wait(); > } > } > if (existingCache != null && !existingCache.isClosed()) { > > DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); > } > } > public void performOps() { > Cache cache = getCache(); > Region region = cache.getRegion("TestRegion"); > DistributedLockService dlockService = > DistributedLockService.getServiceNamed("Bulldog"); > Random random = new Random(); > while (!cache.isClosed()) { > boolean locked = false; > try { > locked = dlockService.lock("testDLock", 500, 60_000); > if (!locked) { > // this could happen if we're starved out for 30sec by other VMs > continue; > } > cache.getCacheTransactionManager().begin(); > region.put("TestKey", "TestValue" + random.nextInt(100000)); > TXManagerImpl mgr = (TXManagerImpl) > getCache().getCacheTransactionManager(); > TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); > TXState txState = (TXState) txProxy.getRealDeal(null, null); > txState.setBeforeSend(() -> { > synchronized(commitLock) { > committing = true; > commitLock.notifyAll(); > }}); > try { > cache.getCacheTransactionManager().commit(); > } catch (CommitConflictException e) { > throw new RuntimeException("dlock failed to prevent a transaction > conflict", e); > } > int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); > getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); > } catch (CancelException | IllegalStateException e) { > // okay to ignore > } finally { > if (locked) { > try { > dlockService.unlock("testDLock"); > } catch (CancelException | IllegalStateException e) { > // shutting down > } > } > } > } > } > {code} > The problem is that the membership listener in TXCommitMessage is removing > itself from the transaction map in TXFarSideCMTracker without setting any > state that the recovery message can check. The recovery method is waiting > like this: > {code} > synchronized (this.txInProgress) { > mess = (TXCommitMessage) this.txInProgress.get(lk); > } > if (mess != null) { > synchronized (mess) { > // tx in progress, we must wait until its done > while (!mess.wasProcessed()) { > try { > mess.wait(); > } catch (InterruptedException ie) { > Thread.currentThread().interrupt(); > logger.error(LocalizedMessage.create( > > LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION, > mess), ie); > break; > } > } > } > {code} > We could probably change this method to make sure that the message is still > in the map instead of only checking wasProcessed(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)