Eric Shu created GEODE-5186: ------------------------------- Summary: set operation in a client transaction could cause the transaction to hang Key: GEODE-5186 URL: https://issues.apache.org/jira/browse/GEODE-5186 Project: Geode Issue Type: Bug Components: transactions Reporter: Eric Shu
During an entry operation in a client transaction, server connection could be lost. In this case, client will failover to another server and try to resume the transaction and retry the operation if the original transaction host node is found. If this operation happens to be a keySet operation (or other set operations) on a partitioned region, the transaction could hang due to a deadlock. The scenario is the original tx host node holds its transactional lock when sending fetchKey request to other nodes hosting the partitioned region data. The node on which the client transaction failed over, will hold its transactional lock while sending the FetchKey message to transaction hosting node. These two FetchKeyMessage will not be able to be processed as processing these tx message requires to hold the lock. But the locks are already been held by the nodes handing the client message of the transaction. {noformat} vm_6_bridge7_latvia_25133:PartitionedRegion Message Processor10 ID=0xe2(226) state=WAITING waiting to lock <java.util.concurrent.locks.ReentrantLock$NonfairSync@453d49bb> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.geode.internal.cache.TXManagerImpl.getLock(TXManagerImpl.java:921) at org.apache.geode.internal.cache.TXManagerImpl.masqueradeAs(TXManagerImpl.java:881) at org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:332) at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:378) at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) at org.apache.geode.distributed.internal.ClusterDistributionManager$8$1.run(ClusterDistributionManager.java:945) at java.lang.Thread.run(Thread.java:745) Locked synchronizers: java.util.concurrent.ThreadPoolExecutor$Worker@c84d7d4 vm_6_bridge7_latvia_25133:ServerConnection on port 23931 Thread 10 ID=0x128(296) state=TIMED_WAITING waiting to lock <java.util.concurrent.CountDownLatch$Sync@226dbb4> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853) at org.apache.geode.internal.cache.partitioned.FetchKeysMessage$FetchKeysResponse.waitForKeys(FetchKeysMessage.java:541) at org.apache.geode.internal.cache.PartitionedRegion.getBucketKeys(PartitionedRegion.java:4342) at org.apache.geode.internal.cache.TXStateStub.getBucketKeys(TXStateStub.java:644) at org.apache.geode.internal.cache.TXStateProxyImpl.getBucketKeys(TXStateProxyImpl.java:730) at org.apache.geode.internal.cache.PartitionedRegion$KeysSet$KeysSetIterator.getNextBucketIter(PartitionedRegion.java:6066) at org.apache.geode.internal.cache.PartitionedRegion$KeysSet$KeysSetIterator.hasNext(PartitionedRegion.java:6024) at java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1041) at org.apache.geode.internal.cache.tier.sockets.command.KeySet.fillAndSendKeySetResponseChunks(KeySet.java:168) at org.apache.geode.internal.cache.tier.sockets.command.KeySet.cmdExecute(KeySet.java:126) at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:157) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMsg(ServerConnection.java:869) at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:77) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1248) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$4$1.run(AcceptorImpl.java:644) at java.lang.Thread.run(Thread.java:745) Locked synchronizers: java.util.concurrent.ThreadPoolExecutor$Worker@3ca60534 java.util.concurrent.locks.ReentrantLock$NonfairSync@453d49bb vm_0_bridge1_latvia_25064:PartitionedRegion Message Processor4 ID=0x2b8(696) state=WAITING waiting to lock <java.util.concurrent.locks.ReentrantLock$NonfairSync@33b1b785> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.geode.internal.cache.TXManagerImpl.getLock(TXManagerImpl.java:921) at org.apache.geode.internal.cache.TXManagerImpl.masqueradeAs(TXManagerImpl.java:881) at org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:332) at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:378) at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) at org.apache.geode.distributed.internal.ClusterDistributionManager$8$1.run(ClusterDistributionManager.java:945) at java.lang.Thread.run(Thread.java:745) Locked synchronizers: java.util.concurrent.ThreadPoolExecutor$Worker@71b1b4c5 vm_0_bridge1_latvia_25064:ServerConnection on port 24946 Thread 0 ID=0x29b(667) state=TIMED_WAITING waiting to lock <java.util.concurrent.CountDownLatch$Sync@41e6d28f> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853) at org.apache.geode.internal.cache.partitioned.FetchKeysMessage$FetchKeysResponse.waitForKeys(FetchKeysMessage.java:541) at org.apache.geode.internal.cache.PartitionedRegion.getBucketKeys(PartitionedRegion.java:4342) at org.apache.geode.internal.cache.TXState.getBucketKeys(TXState.java:1852) at org.apache.geode.internal.cache.TXStateProxyImpl.getBucketKeys(TXStateProxyImpl.java:730) at org.apache.geode.internal.cache.PartitionedRegion$KeysSet$KeysSetIterator.getNextBucketIter(PartitionedRegion.java:6066) at org.apache.geode.internal.cache.PartitionedRegion$KeysSet$KeysSetIterator.hasNext(PartitionedRegion.java:6024) at java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1041) at org.apache.geode.internal.cache.tier.sockets.command.KeySet.fillAndSendKeySetResponseChunks(KeySet.java:168) at org.apache.geode.internal.cache.tier.sockets.command.KeySet.cmdExecute(KeySet.java:126) at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:157) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMsg(ServerConnection.java:869) at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:77) at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1248) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$4$1.run(AcceptorImpl.java:644) at java.lang.Thread.run(Thread.java:745) Locked synchronizers: java.util.concurrent.locks.ReentrantLock$NonfairSync@33b1b785 java.util.concurrent.ThreadPoolExecutor$Worker@51e84752 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)