[ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen Nichols closed GEODE-10403.
--------------------------------

> Distributed deadlock when stopping gateway sender
> -------------------------------------------------
>
>                 Key: GEODE-10403
>                 URL: https://issues.apache.org/jira/browse/GEODE-10403
>             Project: Geode
>          Issue Type: Bug
>          Components: wan
>    Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>            Reporter: Alberto Gomez
>            Assignee: Alberto Gomez
>            Priority: Major
>              Labels: needsTriage, pull-request-available
>             Fix For: 1.15.1, 1.16.0
>
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 
> daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x00007f92bc1c2000 
> nid=0x2154 waiting on condition  [0x00007f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x000000031ca2be50> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)}}
> {{        at 
> java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)}}
> {{        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
> {{        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}
>  
> Thread dump section from blocked call to AbstractGatewaySender.distribute() 
> call trying to acquire the lifecycle lock:
> {{"P2P message reader for 
> 192.168.78.164(eric-data-kvdb-ag-server-0:1)<v31>:41000 shared ordered uid=6 
> local port=60360 remote port=57246" #56 daemon prio=10 os_prio=0 
> cpu=462104.83ms elapsed=7095.02s tid=0x00007f93a8007800 nid=0x50 waiting on 
> condition  [0x00007f93e59d0000]}}
> {{   java.lang.Thread.State: WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00000000ed9cb9f0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1009)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1324)}}
> {{        at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:738)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1104)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6144)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6108)}}
> {{        at 
> org.apache.geode.internal.cache.BucketRegion.notifyGatewaySender(BucketRegion.java:719)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5775)}}
> {{        at 
> org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:704)}}
> {{        at 
> org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$515/0x00000008006e0440.run(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}}
> {{        - locked <0x0000000136123330> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}}
> {{        - locked <0x0000000136123330> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006e0040.run(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006ca440.run(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}}
> {{        at 
> org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}}
> {{        at 
> org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}}
> {{        at 
> org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}}
> {{        at 
> org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:194)}}
> {{        at 
> org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)}}
> {{        at 
> org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:278)}}
> {{        at 
> org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1208)}}
> {{        at 
> org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1110)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2060)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1826)}}
> {{        at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$178/0x0000000800380440.messageReceived(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.dispatchMessage(GMSMembership.java:936)}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.handleOrDeferMessage(GMSMembership.java:867)}}
> {{        at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.processMessage(GMSMembership.java:1209)}}
> {{        at 
> org.apache.geode.distributed.internal.DistributionImpl$MyDCReceiver.messageReceived(DistributionImpl.java:828)}}
> {{        at 
> org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:614)}}
> {{        at 
> org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:679)}}
> {{        at 
> org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3261)}}
> {{        at 
> org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:2988)}}
> {{        at 
> org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2794)}}
> {{        at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1648)}}
> {{        at 
> org.apache.geode.internal.tcp.Connection.run(Connection.java:1479)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}
>  
> Thread dump section from blocked calls to 
> DistributedCacheOperation._distribute() waiting a for a response from a 
> remote peer:
> {{"ServerConnection on port 40404 Thread 2" #88 daemon prio=5 os_prio=0 
> cpu=81268.62ms elapsed=7050.08s tid=0x00007f8160001800 nid=0x73 waiting on 
> condition  [0x00007f8196f57000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x000000031befad38> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)}}
> {{        at 
> org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)}}
> {{        at 
> org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)}}
> {{        at 
> org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:694)}}
> {{        at 
> org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006c9c40.run(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}}
> {{        - locked <0x00000001eb353770> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}}
> {{        - locked <0x00000001eb353770> (a 
> org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006c9840.run(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$512/0x00000008006ca440.run(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}}
> {{        at 
> org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}}
> {{        at 
> org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}}
> {{        at 
> org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}}
> {{        at 
> org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5578)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1213)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:3005)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2215)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5531)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5210)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1181)}}
> {{        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
> {{        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:691)}}
> {{        at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$$Lambda$495/0x00000008006be440.invoke(Unknown
>  Source)}}
> {{        at 
> org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:120)}}
> {{        at 
> org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$166/0x000000080034c040.run(Unknown
>  Source)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to