[
https://issues.apache.org/jira/browse/GEODE-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428359#comment-17428359
]
Alexander Murmann commented on GEODE-9633:
------------------------------------------
Hi [~alberto.bustamante.reyes]! Thanks for digging this up! What version did
you encounter this on? Did this happen on an already released version or on
develop?
> Region and gateway receiver init order may cause a hang
> -------------------------------------------------------
>
> Key: GEODE-9633
> URL: https://issues.apache.org/jira/browse/GEODE-9633
> Project: Geode
> Issue Type: Bug
> Reporter: Alberto Bustamante Reyes
> Priority: Major
>
> This ticket has been created as suggested on [the dev
> list|https://markmail.org/thread/qq32z5hducjoqndz].
> -----
> I have been analyzing an issue that occurs in the following scenario:
> 1) I start two Geode clusters (cluster1 & cluster2) with one locator and two
> servers each.
> Both clusters host a partitioned region called "testregion", which is
> replicated
> using a parallel gateway sender and a gateway receiver.
> ( These are [the gfsh
> files|https://gist.github.com/alb3rtobr/e230623255632937fa68265f31e97f3a] I
> have been using for creating the clusters)
> 2) I run a client connected to cluster2 performing operations on testregion.
> 3) cluster1 is stopped and all persistent data is deleted. And then, I create
> cluster1 again.
> 4) At this point, the command to create "testregion" get stuck.
> After checking the thread stack and the code, I found that the problem is the
> following.
> This thread is trapped on an infinite loop waiting for a bucket primary
> election
> at "PartitionedRegion.waitForNoStorageOrPrimary":
> {code}
> "Function Execution Processor4" tid=0x55
> java.lang.Thread.State: TIMED_WAITING
> at [email protected]/java.lang.Object.wait(Native Method)
> - waiting on org.apache.geode.internal.cache.BucketAdvisor@28be7ae0
> at
> app//org.apache.geode.internal.cache.BucketAdvisor.waitForPrimaryMember(BucketAdvisor.java:1433)
> at
> app//org.apache.geode.internal.cache.BucketAdvisor.waitForNewPrimary(BucketAdvisor.java:825)
> at
> app//org.apache.geode.internal.cache.BucketAdvisor.getPrimary(BucketAdvisor.java:794)
> at
> app//org.apache.geode.internal.cache.partitioned.RegionAdvisor.getPrimaryMemberForBucket(RegionAdvisor.java:1032)
> at
> app//org.apache.geode.internal.cache.PartitionedRegion.getBucketPrimary(PartitionedRegion.java:9081)
> at
> app//org.apache.geode.internal.cache.PartitionedRegion.waitForNoStorageOrPrimary(PartitionedRegion.java:3249)
> at
> app//org.apache.geode.internal.cache.PartitionedRegion.getNodeForBucketWrite(PartitionedRegion.java:3234)
> at
> app//org.apache.geode.internal.cache.PartitionedRegion.shadowPRWaitForBucketRecovery(PartitionedRegion.java:10110)
> at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:564)
> at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:443)
> at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:195)
> at
> app//org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:183)
> at
> app//org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1177)
> at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3050)
> at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2910)
> at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2894)
> at
> app//org.apache.geode.cache.RegionFactory.create(RegionFactory.java:773)
> {code}
> After creating testregion, the sender queue partitioned region is created.
> While
> that region buckets are recovered the command is trapped on an infinite loop
> waiting for a primary bucket election at
> PartitionedRegion.waitForNoStorageOrPrimary.
> This seems to be a known issue because in
> PartitionedRegion.getNodeForBucketWrite, there is the following command before
> calling waitForNoStorageOrPrimary (and the command has been there since
> Geode's
> first commit!) :
> {code}
> // Possible race with loss of redundancy at this point.
> // This loop can possibly create a soft hang if no primary is ever
> selected.
> // This is preferable to returning null since it will prevent obtaining
> the
> // bucket lock for bucket creation.
> return waitForNoStorageOrPrimary(bucketId, "write");
> {code}
> Any idea about why the primary bucket is not elected?
> It seems the failure is related with the fact that "testregion" is receiving
> updates from the receiver before the "create region" command has finished. If
> the test is repeated without traffic on cluster2 or if I create the cluster1's
> receiver after creating "testregion", this problem is not happening.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)