Re: July Community Meeting

2021-07-06 Thread Alexander Murmann
We still have not topics for tomorrow. Let's skip this month.

See you all in August!

From: Alexander Murmann 
Sent: Friday, July 2, 2021 16:02
To: geode 
Subject: July Community Meeting

Hi, my favorite caching community!

Just a reminder that next Wednesday, July 7th is our next community meeting.

So far, we have no proposed 
topics.

Please try to add your topics ahead of time. I suggest that we skip the month 
if we have no proposed topics 12 hours before the meeting (3:00 UTC July 7th).

Hoping to see you all at the meeting!


Re: Questions about conserve-sockets and WAN replication

2021-07-06 Thread Barrett Oglesby
Alberto,

I'd have to see thread dumps on all the members on the site that has the stuck 
thread, but that sounds like you're hitting the limitation with conserve 
sockets and WAN. Are any of the stuck threads shared P2P message readers? If 
so, that is almost definitely a distributed deadlock. Setting 
conserve-sockets=false addresses that deadlock. If you post thread dumps, I'll 
take a look.

Barry

From: Dave Barnes 
Sent: Tuesday, July 6, 2021 4:05 PM
To: dev@geode.apache.org 
Subject: Re: Questions about conserve-sockets and WAN replication

Alberto,
I recently updated some of the descriptions regarding conserve-sockets.
Please check out this PR and see if it addresses any of your concerns.
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6516data=04%7C01%7Cboglesby%40vmware.com%7C7c778f4148de4b7ddde508d940d28557%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637612095204720548%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=4TI7abW8wsWBXqQP8a74G5Gf2%2FR8Qf%2BhZDJZM4FK90k%3Dreserved=0

On Tue, Jul 6, 2021 at 9:57 AM Alberto Gomez  wrote:

> Hi,
>
> The Geode documentation states the following about conserve-sockets and
> WAN deployments in [1]:
>
> "WAN deployments increase the messaging demands on a Geode system. To
> avoid hangs related to WAN messaging, always set `conserve-sockets=false`
> for Geode members that participate in a WAN deployment."
>
> Could anyone please provide some more detailed information about why and
> where these hangs could happen? Is this a hard limitation or something to
> be considered under certain circumstances?
>
> We have run into an unexpected situation which we wonder if it is related
> to the documentation statement above:
>
> In a system like the following:
>  - 2 WAN sites and 3 servers each
>  - several partitioned regions with parallel senders
>  - several replicated regions with serial senders
>  - conserve-sockets set to true
>
> We have sometimes observed, when trying to stop a parallel gateway sender
> while puts are being sent to both sites, that the thread stopping the
> gateway sender in one of the members gets stuck waiting to receive a reply
> from the other members (trying to get the size of the queue, see [2]). We
> see also other threads stuck, some trying to get a lock held by the stuck
> thread and others waiting in
> ReplyProcessor21.waitForRepliesUninterruptibly() trying to put or get data
> remotely (See [3] and [4]).
> If we set conserve-sockets to false we do not experience any hang.
>
> Could these stuck threads be related to what is stated in the
> documentation about WAN deployments and conserve-sockets set to true or
> should we rather think that it is an unrelated bug that needs to be solved?
>
> Thanks in advance for your help,
>
> Alberto
>
> [1]
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fmanaging%2Fmonitor_tune%2Fsockets_and_gateways.htmldata=04%7C01%7Cboglesby%40vmware.com%7C7c778f4148de4b7ddde508d940d28557%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637612095204730504%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=PomlekaoPDoy%2FIqqTOJSaUuQT0BT0VaiwAhBKCd25sY%3Dreserved=0
>
> [2]
> "ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316
> daemon prio=10 os_prio=0 cpu=18.86ms elapsed=1544.80s
> tid=0x7f92bc1c2000 nid=0x2154 waiting on condition  [0x7f9179cd2000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x00031ca2be50> (a
> java.util.concurrent.CountDownLatch$Sync)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11
> /AbstractQueuedSynchronizer.java:1079)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11
> /AbstractQueuedSynchronizer.java:1369)
> at java.util.concurrent.CountDownLatch.await(java.base@11.0.11
> /CountDownLatch.java:278)
> at
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
> at
> 

Re: Questions about conserve-sockets and WAN replication

2021-07-06 Thread Dave Barnes
Alberto,
I recently updated some of the descriptions regarding conserve-sockets.
Please check out this PR and see if it addresses any of your concerns.
https://github.com/apache/geode/pull/6516

On Tue, Jul 6, 2021 at 9:57 AM Alberto Gomez  wrote:

> Hi,
>
> The Geode documentation states the following about conserve-sockets and
> WAN deployments in [1]:
>
> "WAN deployments increase the messaging demands on a Geode system. To
> avoid hangs related to WAN messaging, always set `conserve-sockets=false`
> for Geode members that participate in a WAN deployment."
>
> Could anyone please provide some more detailed information about why and
> where these hangs could happen? Is this a hard limitation or something to
> be considered under certain circumstances?
>
> We have run into an unexpected situation which we wonder if it is related
> to the documentation statement above:
>
> In a system like the following:
>  - 2 WAN sites and 3 servers each
>  - several partitioned regions with parallel senders
>  - several replicated regions with serial senders
>  - conserve-sockets set to true
>
> We have sometimes observed, when trying to stop a parallel gateway sender
> while puts are being sent to both sites, that the thread stopping the
> gateway sender in one of the members gets stuck waiting to receive a reply
> from the other members (trying to get the size of the queue, see [2]). We
> see also other threads stuck, some trying to get a lock held by the stuck
> thread and others waiting in
> ReplyProcessor21.waitForRepliesUninterruptibly() trying to put or get data
> remotely (See [3] and [4]).
> If we set conserve-sockets to false we do not experience any hang.
>
> Could these stuck threads be related to what is stated in the
> documentation about WAN deployments and conserve-sockets set to true or
> should we rather think that it is an unrelated bug that needs to be solved?
>
> Thanks in advance for your help,
>
> Alberto
>
> [1]
> https://geode.apache.org/docs/guide/113/managing/monitor_tune/sockets_and_gateways.html
>
> [2]
> "ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316
> daemon prio=10 os_prio=0 cpu=18.86ms elapsed=1544.80s
> tid=0x7f92bc1c2000 nid=0x2154 waiting on condition  [0x7f9179cd2000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x00031ca2be50> (a
> java.util.concurrent.CountDownLatch$Sync)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11
> /AbstractQueuedSynchronizer.java:1079)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11
> /AbstractQueuedSynchronizer.java:1369)
> at java.util.concurrent.CountDownLatch.await(java.base@11.0.11
> /CountDownLatch.java:278)
> at
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
> at
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
> at
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
> at
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
> at
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
> at
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
> at
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
> at
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
> at
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
> at
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
> at
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
> at
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
> at
> 

Questions about conserve-sockets and WAN replication

2021-07-06 Thread Alberto Gomez
Hi,

The Geode documentation states the following about conserve-sockets and WAN 
deployments in [1]:

"WAN deployments increase the messaging demands on a Geode system. To avoid 
hangs related to WAN messaging, always set `conserve-sockets=false` for Geode 
members that participate in a WAN deployment."

Could anyone please provide some more detailed information about why and where 
these hangs could happen? Is this a hard limitation or something to be 
considered under certain circumstances?

We have run into an unexpected situation which we wonder if it is related to 
the documentation statement above:

In a system like the following:
 - 2 WAN sites and 3 servers each
 - several partitioned regions with parallel senders
 - several replicated regions with serial senders
 - conserve-sockets set to true

We have sometimes observed, when trying to stop a parallel gateway sender while 
puts are being sent to both sites, that the thread stopping the gateway sender 
in one of the members gets stuck waiting to receive a reply from the other 
members (trying to get the size of the queue, see [2]). We see also other 
threads stuck, some trying to get a lock held by the stuck thread and others 
waiting in ReplyProcessor21.waitForRepliesUninterruptibly() trying to put or 
get data remotely (See [3] and [4]).
If we set conserve-sockets to false we do not experience any hang.

Could these stuck threads be related to what is stated in the documentation 
about WAN deployments and conserve-sockets set to true or should we rather 
think that it is an unrelated bug that needs to be solved?

Thanks in advance for your help,

Alberto

[1] 
https://geode.apache.org/docs/guide/113/managing/monitor_tune/sockets_and_gateways.html

[2]
"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 daemon 
prio=10 os_prio=0 cpu=18.86ms elapsed=1544.80s tid=0x7f92bc1c2000 
nid=0x2154 waiting on condition  [0x7f9179cd2000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x00031ca2be50> (a 
java.util.concurrent.CountDownLatch$Sync)
at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
at 
org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
at 
org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
at 
org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
at 
org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
at 
org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
at 
org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
at 
org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)
at 
org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)
at 
java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
   

NullPointerException while create region during server restart

2021-07-06 Thread Mario Kevo
Hi Geode devs,

I opened a new ticket https://issues.apache.org/jira/browse/GEODE-9409 
regarding NullPointerException on creating region while one of the servers is 
restarting.
If we run the "create region" command through gfsh while the server is starting 
it passed, but if the server is restarted then it fails. The difference is that 
when we restarted the server, we kill them and start again. As it has already a 
server directory, it takes more time to get the server up as expected.
In that case, if we run the "create region" command it can happen that the 
cache is not fully created and we are trying to do something on that. That can 
lead to the NullPointerException, as creating region catches pdxRegistry from 
the cache while doing findDiskStore, but sometimes it is not initialized in the 
cache yet. So every method run against that will throw NullPoniterException.
There is a part of the code where the exception is thrown:

DiskStoreImpl findDiskStore(RegionAttributes regionAttributes,
InternalRegionArguments internalRegionArgs) {
  // validate that persistent type registry is persistent
  if (getAttributes().getDataPolicy().withPersistence()) {
getCache().getPdxRegistry().creatingPersistentRegion();
  }

As I already mention, getPdxRegistry(LocalRegion.java) will be null if it is 
not yet initialized in create(CacheCreation.java):

DiskStoreAttributesCreation pdxRegDSC = initializePdxDiskStore(cache);

cache.initializePdxRegistry();

createDiskStores(cache, pdxRegDSC);

I tried to do some fixes, but without a success. 
It can be passed if we add some retry and sleep, but that is not acceptable.

So if someone has some idea how to do some wait until pdxRegistry is 
initialized or something else what will help us to avoid this problem?

BR,
Mario