[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server

2018-04-30 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated GEODE-5155:
--
Labels: pull-request-available  (was: )

> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>  Labels: pull-request-available
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // okay to ignore
>   } finally {
> if (locked) {
>   try {
> dlockService.unlock("testDLock");
>   } catch (CancelException | IllegalStateException e) {
> // shutting down
>   }
> }
>   }
> }
>   }
> {code}
> The problem is that the membership listener in TXCommitMessage is removing 
> itself from the transaction map in TXFarSideCMTracker without setting any 
> state that the recovery message can check.  The recovery method is waiting 
> like this:
> {code}
> synchronized 

[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server

2018-04-30 Thread Bruce Schuchardt (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Schuchardt updated GEODE-5155:

Affects Version/s: 1.7.0

> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // okay to ignore
>   } finally {
> if (locked) {
>   try {
> dlockService.unlock("testDLock");
>   } catch (CancelException | IllegalStateException e) {
> // shutting down
>   }
> }
>   }
> }
>   }
> {code}
> The problem is that the membership listener in TXCommitMessage is removing 
> itself from the transaction map in TXFarSideCMTracker without setting any 
> state that the recovery message can check.  The recovery method is waiting 
> like this:
> {code}
> synchronized (this.txInProgress) {
>   mess = (TXCommitMessage) 

[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server

2018-04-30 Thread Bruce Schuchardt (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Schuchardt updated GEODE-5155:

Issue Type: Bug  (was: New Feature)

> hang recovering transaction state for crashed server
> 
>
> Key: GEODE-5155
> URL: https://issues.apache.org/jira/browse/GEODE-5155
> Project: Geode
>  Issue Type: Bug
>  Components: distributed lock service, transactions
>Affects Versions: 1.7.0
>Reporter: Bruce Schuchardt
>Assignee: Bruce Schuchardt
>Priority: Major
>
> A concourse job failed in 
> DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
> two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2] java.lang.Thread.State: WAITING
> [vm2] at java.lang.Object.wait(Native Method)
> [vm2] -  waiting on 
> org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] at java.lang.Object.wait(Object.java:502)
> [vm2] at 
> org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] at 
> org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods 
> to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
> Cache existingCache = basicGetCache();
> synchronized(commitLock) {
>   committing = false;
>   while (!committing) {
> commitLock.wait();
>   }
> }
> if (existingCache != null && !existingCache.isClosed()) {
>   
> DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
> }
>   }
>   public void performOps() {
> Cache cache = getCache();
> Region region = cache.getRegion("TestRegion");
> DistributedLockService dlockService = 
> DistributedLockService.getServiceNamed("Bulldog");
> Random random = new Random();
> while (!cache.isClosed()) {
>   boolean locked = false;
>   try {
> locked = dlockService.lock("testDLock", 500, 60_000);
> if (!locked) {
>   // this could happen if we're starved out for 30sec by other VMs
>   continue;
> }
> cache.getCacheTransactionManager().begin();
> region.put("TestKey", "TestValue" + random.nextInt(10));
> TXManagerImpl mgr = (TXManagerImpl) 
> getCache().getCacheTransactionManager();
> TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
> TXState txState = (TXState) txProxy.getRealDeal(null, null);
> txState.setBeforeSend(() -> {
>   synchronized(commitLock) {
> committing = true;
> commitLock.notifyAll();
>   }});
> try {
>   cache.getCacheTransactionManager().commit();
> } catch (CommitConflictException e) {
>   throw new RuntimeException("dlock failed to prevent a transaction 
> conflict", e);
> }
> int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
> getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>   } catch (CancelException | IllegalStateException e) {
> // okay to ignore
>   } finally {
> if (locked) {
>   try {
> dlockService.unlock("testDLock");
>   } catch (CancelException | IllegalStateException e) {
> // shutting down
>   }
> }
>   }
> }
>   }
> {code}
> The problem is that the membership listener in TXCommitMessage is removing 
> itself from the transaction map in TXFarSideCMTracker without setting any 
> state that the recovery message can check.  The recovery method is waiting 
> like this:
> {code}
> synchronized (this.txInProgress) {
>   mess = (TXCommitMessage) 

[jira] [Updated] (GEODE-5155) hang recovering transaction state for crashed server

2018-04-30 Thread Bruce Schuchardt (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Schuchardt updated GEODE-5155:

Description: 
A concourse job failed in 
DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with 
two threads stuck in this state:

{noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
[vm2] java.lang.Thread.State: WAITING
[vm2]   at java.lang.Object.wait(Native Method)
[vm2]   -  waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6
[vm2]   at java.lang.Object.wait(Object.java:502)
[vm2]   at 
org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
[vm2]   at 
org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
[vm2]   at 
org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
[vm2]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[vm2]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[vm2]   at 
org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
[vm2]   at 
org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
[vm2]   at 
org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
[vm2]   at java.lang.Thread.run(Thread.java:748)
{noformat}

I modified the test to tighten up its forcedDisconnect and performOps methods 
to get transaction recovery to happen more reliably.

{code}
  public void forceDisconnect() throws Exception {
Cache existingCache = basicGetCache();
synchronized(commitLock) {
  committing = false;
  while (!committing) {
commitLock.wait();
  }
}
if (existingCache != null && !existingCache.isClosed()) {
  
DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
}
  }


  public void performOps() {
Cache cache = getCache();
Region region = cache.getRegion("TestRegion");
DistributedLockService dlockService = 
DistributedLockService.getServiceNamed("Bulldog");
Random random = new Random();

while (!cache.isClosed()) {
  boolean locked = false;
  try {
locked = dlockService.lock("testDLock", 500, 60_000);
if (!locked) {
  // this could happen if we're starved out for 30sec by other VMs
  continue;
}

cache.getCacheTransactionManager().begin();

region.put("TestKey", "TestValue" + random.nextInt(10));

TXManagerImpl mgr = (TXManagerImpl) 
getCache().getCacheTransactionManager();
TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
TXState txState = (TXState) txProxy.getRealDeal(null, null);
txState.setBeforeSend(() -> {
  synchronized(commitLock) {
committing = true;
commitLock.notifyAll();
  }});

try {
  cache.getCacheTransactionManager().commit();
} catch (CommitConflictException e) {
  throw new RuntimeException("dlock failed to prevent a transaction 
conflict", e);
}

int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);

  } catch (CancelException | IllegalStateException e) {
// okay to ignore
  } finally {
if (locked) {
  try {
dlockService.unlock("testDLock");
  } catch (CancelException | IllegalStateException e) {
// shutting down
  }
}
  }
}
  }
{code}

The problem is that the membership listener in TXCommitMessage is removing 
itself from the transaction map in TXFarSideCMTracker without setting any state 
that the recovery message can check.  The recovery method is waiting like this:

{code}
synchronized (this.txInProgress) {
  mess = (TXCommitMessage) this.txInProgress.get(lk);
}
if (mess != null) {
  synchronized (mess) {
// tx in progress, we must wait until its done
while (!mess.wasProcessed()) {
  try {
mess.wait();
  } catch (InterruptedException ie) {
Thread.currentThread().interrupt();
logger.error(LocalizedMessage.create(

LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION,
mess), ie);
break;
  }
}
  }
{code}

We could probably change this method to make sure that the message is still in 
the map instead of only checking wasProcessed().


  was:
A concourse job failed in