[ 
https://issues.apache.org/jira/browse/GEODE-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197081#comment-17197081
 ] 

ASF subversion and git services commented on GEODE-8473:
--------------------------------------------------------

Commit c48c0c378f90bb2912e018856a1f6e3a46a610e8 in geode's branch 
refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=c48c0c3 ]

GEODE-8473: Hang in ReplyProcessor21 when forced-disconnect does not establish 
a cancellation cause (#5491)

Ensure that the cache is informed of a forced-disconnect in the
DisconnectThread.  This is a follow-on commit to GEODE-8467, which
ensured that the DisconnectThread is launched in the presence of cache
XML generation failure.  This commit adds a try/catch in
GMSMembership.uncleanShutdown() to ensure that the up-stream
ClusterDistributionManager is informed of the failure so it can set the
"rootCause" in its CancelCriterion.  ReplyProcessor21 and other objects
that poll for this "rootCause" will then be released from waiting for
responses to messages sent to other members of the cluster.

> Hang in ReplyProcessor21 when forced-disconnect does not establish a 
> cancellation cause
> ---------------------------------------------------------------------------------------
>
>                 Key: GEODE-8473
>                 URL: https://issues.apache.org/jira/browse/GEODE-8473
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.13.0
>            Reporter: Bruce J Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>
> I suspect this is due to the recent Membership refactoring.  In a test that 
> exposed GEODE-8467 I saw an application thread from before the 
> forced-disconnect still hanging around waiting for a response.
> {noformat}
>    java.lang.Thread.State: TIMED_WAITING (parking)   java.lang.Thread.State: 
> TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to 
> wait for  <0x00000000ea5c43c0> (a java.util.concurrent.CountDownLatch$Sync) 
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
>  at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
>  at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
>  at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
>  at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
>  at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
>  at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6752)
>  at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6703)
>  at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6685)
>  at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6657)
>  at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
>  at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078) 
> at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8288) at 
> util.TestHelper.getRegionStr(TestHelper.java:1669) at 
> util.TestHelper.regionHierarchyToString(TestHelper.java:1654) at 
> util.TestHelper.logRegionHierarchy(TestHelper.java:1639) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> hydra.MethExecutor.execute(MethExecutor.java:173) at 
> hydra.MethExecutor.execute(MethExecutor.java:141) at 
> hydra.TestTask.execute(TestTask.java:197) at 
> hydra.RemoteTestModule$1.run(RemoteTestModule.java:213) {noformat}
> ReplyProcessor21 uses a StoppableCountdownLatch to wait for a response.  This 
> latch loops waiting for countdown but also checks 
> ClusterDistributionManager's CancelCriterion to see if the system is shutting 
> down.  If so it stops waiting for a response.
> Due to GEODE-8467 the thread that sets the CancelCriterion's shutdown 
> "rootCause" is never started.  Either Membership needs to ensure that this 
> upward notification happens or ClusterDistributionManager's CancelCriterion 
> needs to check with the Services.Stopper in GMSMembership to see if a 
> "rootCause" has been established there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to