[ https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Owen Nichols closed GEODE-10403. -------------------------------- > Distributed deadlock when stopping gateway sender > ------------------------------------------------- > > Key: GEODE-10403 > URL: https://issues.apache.org/jira/browse/GEODE-10403 > Project: Geode > Issue Type: Bug > Components: wan > Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0 > Reporter: Alberto Gomez > Assignee: Alberto Gomez > Priority: Major > Labels: needsTriage, pull-request-available > Fix For: 1.15.1, 1.16.0 > > > A distributed deadlock has been found during some tests of a Geode system > with WAN replication when stopping the gateway sender while sending a fair > amount of operations to the servers. > The distributed deadlock manifests in the gateway sender stop command hanging > forever and by all normal Geode operations from clients (gets, puts,...) not > being responded. > The situation is provoked by the Gateway sender stop command that first takes > the lifecycle lock and then, at a given point, tries to retrieve the size of > the gateway sender. This operation, that requires communication with the > other peers never finishes, probably because the response from one of the > peers is never received. > Another thread is blocked when trying to acquire the lifecycle lock in > AbstractGatewaySender.distribute(). > Finally many threads handling Geode operations (get, put...) get blocked in > the DistributedCacheOperation._distribute() call waiting for a response from > another peer. > Thread dump section from blocked gateway sender stop command in call to get > size of queue: > {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 > daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x00007f92bc1c2000 > nid=0x2154 waiting on condition [0x00007f9179cd2000]}} > {{ java.lang.Thread.State: TIMED_WAITING (parking)}} > {{ at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}} > {{ - parking to wait for <0x000000031ca2be50> (a > java.util.concurrent.CountDownLatch$Sync)}} > {{ at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}} > {{ at > java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}} > {{ at > org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}} > {{ at > org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}} > {{ at > org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}} > {{ at > org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}} > {{ at > org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}} > {{ at > org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}} > {{ at > org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}} > {{ at > org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)}} > {{ at > java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)}} > {{ at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}} > {{ at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}} > {{ at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}} > > Thread dump section from blocked call to AbstractGatewaySender.distribute() > call trying to acquire the lifecycle lock: > {{"P2P message reader for > 192.168.78.164(eric-data-kvdb-ag-server-0:1)<v31>:41000 shared ordered uid=6 > local port=60360 remote port=57246" #56 daemon prio=10 os_prio=0 > cpu=462104.83ms elapsed=7095.02s tid=0x00007f93a8007800 nid=0x50 waiting on > condition [0x00007f93e59d0000]}} > {{ java.lang.Thread.State: WAITING (parking)}} > {{ at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}} > {{ - parking to wait for <0x00000000ed9cb9f0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)}} > {{ at > java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1009)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1324)}} > {{ at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:738)}} > {{ at > org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1104)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6144)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6108)}} > {{ at > org.apache.geode.internal.cache.BucketRegion.notifyGatewaySender(BucketRegion.java:719)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5775)}} > {{ at > org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:704)}} > {{ at > org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$515/0x00000008006e0440.run(Unknown > Source)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}} > {{ - locked <0x0000000136123330> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}} > {{ - locked <0x0000000136123330> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006e0040.run(Unknown > Source)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006ca440.run(Unknown > Source)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}} > {{ at > org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}} > {{ at > org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}} > {{ at > org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}} > {{ at > org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}} > {{ at > org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:194)}} > {{ at > org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)}} > {{ at > org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:278)}} > {{ at > org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1208)}} > {{ at > org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1110)}} > {{ at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)}} > {{ at > org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2060)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1826)}} > {{ at > org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$178/0x0000000800380440.messageReceived(Unknown > Source)}} > {{ at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.dispatchMessage(GMSMembership.java:936)}} > {{ at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.handleOrDeferMessage(GMSMembership.java:867)}} > {{ at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.processMessage(GMSMembership.java:1209)}} > {{ at > org.apache.geode.distributed.internal.DistributionImpl$MyDCReceiver.messageReceived(DistributionImpl.java:828)}} > {{ at > org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:614)}} > {{ at > org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:679)}} > {{ at > org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3261)}} > {{ at > org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:2988)}} > {{ at > org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2794)}} > {{ at > org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1648)}} > {{ at > org.apache.geode.internal.tcp.Connection.run(Connection.java:1479)}} > {{ at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}} > > Thread dump section from blocked calls to > DistributedCacheOperation._distribute() waiting a for a response from a > remote peer: > {{"ServerConnection on port 40404 Thread 2" #88 daemon prio=5 os_prio=0 > cpu=81268.62ms elapsed=7050.08s tid=0x00007f8160001800 nid=0x73 waiting on > condition [0x00007f8196f57000]}} > {{ java.lang.Thread.State: TIMED_WAITING (parking)}} > {{ at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}} > {{ - parking to wait for <0x000000031befad38> (a > java.util.concurrent.CountDownLatch$Sync)}} > {{ at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}} > {{ at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}} > {{ at > java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}} > {{ at > org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}} > {{ at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}} > {{ at > org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)}} > {{ at > org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)}} > {{ at > org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)}} > {{ at > org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:694)}} > {{ at > org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006c9c40.run(Unknown > Source)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}} > {{ - locked <0x00000001eb353770> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}} > {{ - locked <0x00000001eb353770> (a > org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006c9840.run(Unknown > Source)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$512/0x00000008006ca440.run(Unknown > Source)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}} > {{ at > org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}} > {{ at > org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}} > {{ at > org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}} > {{ at > org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5578)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1213)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:3005)}} > {{ at > org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2215)}} > {{ at > org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5531)}} > {{ at > org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5210)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1181)}} > {{ at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}} > {{ at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:691)}} > {{ at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$$Lambda$495/0x00000008006be440.invoke(Unknown > Source)}} > {{ at > org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:120)}} > {{ at > org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$166/0x000000080034c040.run(Unknown > Source)}} > {{ at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}} -- This message was sent by Atlassian Jira (v8.20.10#820010)