[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server

ASF subversion and git services (JIRA) Wed, 02 May 2018 10:27:45 -0700

    [ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461340#comment-16461340
 ]


ASF subversion and git services commented on GEODE-5155:
--------------------------------------------------------

Commit 2c1b8a4edd99c3b5d25697a08b917de3310c31ae in geode's branch 
refs/heads/develop from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=2c1b8a4 ]

GEODE-5155 hang recovering transaction state for crashed server

fixing an NPE caused by the test creating a new cache in between
starting a transaction and installing a test hook.


> hang recovering transaction state for crashed server
> ----------------------------------------------------
>
>                 Key: GEODE-5155
>                 URL: https://issues.apache.org/jira/browse/GEODE-5155
>             Project: Geode
>          Issue Type: Bug
>          Components: distributed lock service, transactions
>    Affects Versions: 1.7.0
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2]     java.lang.Thread.State: WAITING
> [vm2]         at java.lang.Object.wait(Native Method)
> [vm2]         -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2]         at java.lang.Object.wait(Object.java:502)
> [vm2]         at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2]         at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2]         at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2]         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2]         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2]         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2]         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2]         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2]         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
>     Cache existingCache = basicGetCache();
>     synchronized(commitLock) {
>       committing = false;
>       while (!committing) {
>         commitLock.wait();
>       }
>     }
>     if (existingCache != null && !existingCache.isClosed()) {
>       
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
>     }
>   }
>   public void performOps() {
>     Cache cache = getCache();
>     Region region = cache.getRegion("TestRegion");
>     DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
>     Random random = new Random();
>     while (!cache.isClosed()) {
>       boolean locked = false;
>       try {
>         locked = dlockService.lock("testDLock", 500, 60_000);
>         if (!locked) {
>           // this could happen if we're starved out for 30sec by other VMs
>           continue;
>         }
>         cache.getCacheTransactionManager().begin();
>         region.put("TestKey", "TestValue" + random.nextInt(100000));
>         TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
>         TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
>         TXState txState = (TXState) txProxy.getRealDeal(null, null);
>         txState.setBeforeSend(() -> {
>           synchronized(commitLock) {
>             committing = true;
>             commitLock.notifyAll();
>           }});
>         try {
>           cache.getCacheTransactionManager().commit();
>         } catch (CommitConflictException e) {
>           throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
>         }
>         int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
>         getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>       } catch (CancelException | IllegalStateException e) {
>         // okay to ignore
>       } finally {
>         if (locked) {
>           try {
>             dlockService.unlock("testDLock");
>           } catch (CancelException | IllegalStateException e) {
>             // shutting down
>           }
>         }
>       }
>     }
>   }
> {code}
> The problem is that the membership listener in TXCommitMessage is removing 
> itself from the transaction map in TXFarSideCMTracker without setting any 
> state that the recovery message can check.  The recovery method is waiting 
> like this:
> {code}
>     synchronized (this.txInProgress) {
>       mess = (TXCommitMessage) this.txInProgress.get(lk);
>     }
>     if (mess != null) {
>       synchronized (mess) {
>         // tx in progress, we must wait until its done
>         while (!mess.wasProcessed()) {
>           try {
>             mess.wait();
>           } catch (InterruptedException ie) {
>             Thread.currentThread().interrupt();
>             logger.error(LocalizedMessage.create(
>                 
> LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION,
>                 mess), ie);
>             break;
>           }
>         }
>       }
> {code}
> We could probably change this method to make sure that the message is still 
> in the map instead of only checking wasProcessed().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (GEODE-5155) hang recovering transaction state for crashed server

Reply via email to