Re: Apache Geode 1.15.1 patch version

2022-09-15 Thread Jakov

Hi all,

I propose these PRs as well:

GEODE-10412: Destroy region command doesn't clear the region related 
expired tombstones <https://issues.apache.org/jira/browse/GEODE-10412>


GEODE-10281: Internal conflict replicated region resolution causes data 
inconsistency between two site 
<https://issues.apache.org/jira/browse/GEODE-10281>s


GEODE-10401: Oplog recovery takes too long due to fault in fastutil 
library <https://issues.apache.org/jira/browse/GEODE-10401>


BR,

Jakov

On 15. 09. 2022. 11:33, Alberto Gomez wrote:

Hi community,

I propose to add the following PRs to this patch release:

[Bug]<https://issues.apache.org/jira/browse/GEODE-10417>
GEODE-10417<https://issues.apache.org/jira/browse/GEODE-10417>

Fix NullPointerException when getting events from the gw sender queue to complete 
transactions<https://issues.apache.org/jira/browse/GEODE-10417>

[Bug]<https://issues.apache.org/jira/browse/GEODE-10403>
GEODE-10403<https://issues.apache.org/jira/browse/GEODE-10403>

Distributed deadlock when stopping gateway 
sender<https://issues.apache.org/jira/browse/GEODE-10403>

[Improvement]<https://issues.apache.org/jira/browse/GEODE-10371>
GEODE-10371<https://issues.apache.org/jira/browse/GEODE-10371>

C++ Native client: Improve dispersion on connections 
expiration<https://issues.apache.org/jira/browse/GEODE-10371>

[Bug]<https://issues.apache.org/jira/browse/GEODE-10352>
GEODE-10352<https://issues.apache.org/jira/browse/GEODE-10352>

Update Dockerfile to use Ruby >= 2.6 in the tool to preview Geode 
documentation<https://issues.apache.org/jira/browse/GEODE-10352>

[Bug]<https://issues.apache.org/jira/browse/GEODE-10348>
GEODE-10348<https://issues.apache.org/jira/browse/GEODE-10348>

Correct documentation about 
conflation<https://issues.apache.org/jira/browse/GEODE-10348>

[Bug]<https://issues.apache.org/jira/browse/GEODE-10346>
GEODE-10346<https://issues.apache.org/jira/browse/GEODE-10346>

Correct batch-time-interval description in 
documentation<https://issues.apache.org/jira/browse/GEODE-10346>

[Bug]<https://issues.apache.org/jira/browse/GEODE-10323>
GEODE-10323<https://issues.apache.org/jira/browse/GEODE-10323>

OffHeapStorageJUnitTest testCreateOffHeapStorage fails with AssertionError: expected:<100> 
but was:<1048576><https://issues.apache.org/jira/browse/GEODE-10323>

[Bug]<https://issues.apache.org/jira/browse/GEODE-10155>
GEODE-10155<https://issues.apache.org/jira/browse/GEODE-10155>

ServerConnection thread hangs when client function execution 
times-out<https://issues.apache.org/jira/browse/GEODE-10155>

[Improvement]<https://issues.apache.org/jira/browse/GEODE-10076>
GEODE-10076<https://issues.apache.org/jira/browse/GEODE-10076>

Fix string codepoint 
detection<https://issues.apache.org/jira/browse/GEODE-10076>

BR,

Alberto

From: Anthony Baker
Sent: Friday, September 9, 2022 8:14 PM
To:dev@geode.apache.org  
Cc: Weijie Xu M
Subject: Re: Apache Geode 1.15.1 patch version

Thanks Mario. I removed some entries from the list that didn’t seem relevant to 
a small patch release. I think previously Xu Weijie volunteered to look 
athttps://issues.apache.org/jira/browse/GEODE-10415.

Anthony


On Sep 8, 2022, at 11:20 PM, Mario Kevo 
mailto:mario.k...@est.tech>> wrote:

⚠ External Email

Hi all,

I'm going to build a new patch version of the Geode.
There is a list of tasks that are declared to be fixed in 1.15.1. As they are already 
assigned, please can the assignee provide a fix for this so we can move 
on?https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FGEODE%2Fversions%2F12351801&data=05%7C01%7Cbakera%40vmware.com%7Cb682687cff424bd66c4f08da922b68c9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637983012385132654%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zyj0QD8xHMKWlB92DRlVsZg97ay4Rszqlist8Nut5J0%3D&reserved=0

Also, there is one blocker that will be good to introduce to this release, if it is okay 
for all of 
you.https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-10415&data=05%7C01%7Cbakera%40vmware.com%7Cb682687cff424bd66c4f08da922b68c9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637983012385289006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lhY3UUTY36WFsktv5hjhIH31I7gJW0F94ipJL0ZgKYU%3D&reserved=0

Please suggest if you have some more tickets that are critical and should be 
backported to this release, so we can get an opinion of the community on that 
before releasing the new version.

Thanks and BR,
Mario




⚠ External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.


Question regarding --J=-Dgeode.LOG_LEVEL_UPDATE_OCCURS system property

2022-05-18 Thread Jakov Varenina

Hi devs,

When the --J=-Dgeode.LOG_LEVEL_UPDATE_OCCURS=ALWAYS system property is 
set, the server/locator ignores the log level specified in the custom 
Log4J configuration file at the startup. Is this expected behavior?


I could only find one note related to the above property in the "change 
loglevel" command document 
https://geode.apache.org/docs/guide/114/tools_modules/gfsh/command-pages/change.html 
:


"When using a custom Log4J configuration, this command takes effect only 
if the member whose logging level you want to change was started using 
the --J=-Dgeode.LOG_LEVEL_UPDATE_OCCURS=ALWAYS system property."


From this note, it is not clear what other effects of this system 
property are besides allowing this command to execute. I observed in 
tests that the geode log-level parameter (e.g., start server 
--log-level=debug) and custom Log4J log-level parameter (e.g., name="org.apache.geode" level="DEBUG">) are mutually exclusive. Users 
can choose between these two options with the LOG_LEVEL_UPDATE_OCCURS 
system property. Is this a correct understanding?


Best Regards,
Jakov


Re: Peer-to-peer connections use the --server-port causing the server to hang during startup

2022-05-03 Thread Jakov Varenina

Thank you Dan. You helped a lot!

Best Regards,
Jakov

On 02. 05. 2022. 19:06, Dan Smith wrote:

I think the membership-port-range determines the port that the server side of 
the TCP socket is listening on.

What I see in your log statement is the port number of the client side of the 
socket (where it says localport=37392). The port for the client side of a 
socket comes from the ephemeral port range configured on your OS - usually 
32768–60999 in linux.

https://en.wikipedia.org/wiki/Ephemeral_port

-Dan

From: Jakov Varenina 
Sent: Monday, May 2, 2022 5:49 AM
To: dev@geode.apache.org 
Subject: Peer-to-peer connections use the --server-port causing the server to 
hang during startup

⚠ External Email

Hi devs,


We have noticed that some peer-to-peer connections use ports outside the
range defined by the membership-port-range (41000-61000) parameter:


[vm1] membership-port-range=41000-61000

[vm1] [debug 2022/05/02 11:15:57.968 CEST server-1  tid=0x1a] starting peer-to-peer handshake on
socket Socket[addr=/192.168.1.36,port=49913,*localport=37392*]


Is it expected that peer-to-peer connections use ports outside the above
range?


Also, due to the above behavior, the following bug occurs:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-10268&data=05%7C01%7Cdasmith%40vmware.com%7Cf61b0d0e14774737c7ee08da2c3a2e72%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637870925648179770%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VJQGl83OSKWTWwn%2FR7bdJorcQqAP3goEPz4cddHHgZM%3D&reserved=0
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-10268&data=05%7C01%7Cdasmith%40vmware.com%7Cf61b0d0e14774737c7ee08da2c3a2e72%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637870925648179770%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VJQGl83OSKWTWwn%2FR7bdJorcQqAP3goEPz4cddHHgZM%3D&reserved=0>


Best Regards,

Jakov



⚠ External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.


Peer-to-peer connections use the --server-port causing the server to hang during startup

2022-05-02 Thread Jakov Varenina

Hi devs,


We have noticed that some peer-to-peer connections use ports outside the 
range defined by the membership-port-range (41000-61000) parameter:



[vm1] membership-port-range=41000-61000

[vm1] [debug 2022/05/02 11:15:57.968 CEST server-1 Connection(1)-192.168.1.36> tid=0x1a] starting peer-to-peer handshake on 
socket Socket[addr=/192.168.1.36,port=49913,*localport=37392*]



Is it expected that peer-to-peer connections use ports outside the above 
range?



Also, due to the above behavior, the following bug occurs: 
https://issues.apache.org/jira/browse/GEODE-10268 
<https://issues.apache.org/jira/browse/GEODE-10268>



Best Regards,

Jakov


Re: WAN replication not working after re-creating the partitioned region

2022-04-26 Thread Jakov Varenina

Hi devs,

This is just a kind reminder for the question in the below mail and new 
info the issue.


You can find the new PR on this link: 
https://github.com/apache/geode/pull/7623


Since this PR was just a documentation update, reviewers were not 
automatically assigned. Could somebody please check it?


Best Regards,
Jakov

On 06. 04. 2022. 15:18, Jakov Varenina wrote:


Hi devs,


We have found one scenario where WAN replication is not working:


 1. Create a parallel gateway sender and the region
 2. Run some traffic so that all buckets are created
 3. Alter region to remove the gateway sender (alter region
--name=/example-region --gateway-sender-id="")
 4. Destroy the region
 5. Recreate the region with the same gateway-sender created in step 1
 6. Run some traffic to see that WAN replication is not working correctly


Is this a valid scenario?


You can find more information in this ticket: 
https://issues.apache.org/jira/browse/GEODE-10215


and PR (contains the test case that reproduces the issue): 
https://github.com/apache/geode/pull/7549



Best Regards,

Jakov


WAN replication not working after re-creating the partitioned region

2022-04-06 Thread Jakov Varenina

Hi devs,


We have found one scenario where WAN replication is not working:


1. Create a parallel gateway sender and the region
2. Run some traffic so that all buckets are created
3. Alter region to remove the gateway sender (alter region
   --name=/example-region --gateway-sender-id="")
4. Destroy the region
5. Recreate the region with the same gateway-sender created in step 1
6. Run some traffic to see that WAN replication is not working correctly


Is this a valid scenario?


You can find more information in this ticket: 
https://issues.apache.org/jira/browse/GEODE-10215


and PR (contains the test case that reproduces the issue): 
https://github.com/apache/geode/pull/7549



Best Regards,

Jakov


Re: Question related to gateway-receivers connection load balancing

2022-03-14 Thread Jakov Varenina

Hi Barry,


Thank you for the reply and detailed analysis!


You are correct that I misunderstood the part about coordinator locator 
behavior; sorry about that. The client receives a list of locators in 
the RemoteLocatorJoinResponse message. Then client loops (from first to 
the last) through that locator list until it successfully sends the 
ClientConnectionRequest message. So that means the client will actually 
send connection requests to the first available locator, which doesn't 
have to be a coordinator.



Still, in normal conditions, all connection requests will be handled by 
the same locator (the first one in the list of locators).



Regarding PR, I tried to align gateway-receivers connection load 
handling with the client connections load handling on the locator. But I 
have encountered one race condition that I don't know how to solve, 
which I explained in this comment: 
https://github.com/apache/geode/pull/7378#issuecomment-1048513322 
<https://github.com/apache/geode/pull/7378#issuecomment-1048513322>



Thanks,

Jakov


On 12. 03. 2022. 02:33, Barry Oglesby wrote:

Jakov,

I'm not sure about the coordinator / non-coordinator behavior you're 
describing, but I see the other behavior. It doesn't seem quite right.

Here is a detailed explanation of what I see.

LocatorLoadSnapshot.getServerForConnection increments the LoadHolder's 
connections which updates its load. In the receiver's case, 
getServerForConnection is called by the remote sender to get a server to 
connect to (as you said using a ClientConnectionRequest).

Normally, the LoadHolder's load is updated when load is received in 
LocatorLoadSnapshot.updateMap (also as you said using a 
CacheServerLoadMessage). This doesn't happen in the receiver's case.

Locator behavior:

When the receiver connects, it sends its profile to the locator which causes 
its LoadHolder to be added to the __recv__group map. It is not added to the 
null group map. Thats the map for normal servers that have no group. Based on 
current implementation, it can't be added to this map or it would be used for 
normal local (to the receiver) client connections.

LocatorLoadSnapshot.addServer location=192.168.1.5:5409; groups=[__recv__group]
LocatorLoadSnapshot.addGroups group=__recv__group; location=192.168.1.5:5409
LocatorLoadSnapshot.addGroups not adding to the null map group=__recv__group; 
location=192.168.1.5:5409

When the load is received for the receiver, it is ignored. updateMap gets the 
LoadHolder from the null group map. Since that map was not updated for the 
receiver, there is no entry for it (holder=null below).

LocatorLoadSnapshot.updateLoad about to update connectionLoadMap 
location=192.168.1.5:5409
LocatorLoadSnapshot.updateMap location=192.168.1.5:5409; load=0.0; 
loadPerConnection=0.00125
LocatorLoadSnapshot.updateMap location=192.168.1.5:5409; load=0.0; 
loadPerConnection=0.00125; holder=null
LocatorLoadSnapshot.updateLoad ignoring load location=192.168.1.5:5409
LocatorLoadSnapshot.updateLoad done update connectionLoadMap 
location=192.168.1.5:5409

So, load is not updated in this way for a receiver.

When a request for a remote receiver is received, it uses the __recv__group 
load to provide that server. It also increments the load to that server (in 
LoadHolder.incConnections). This is how load is updated for a receiver.

LocatorLoadSnapshot.getServerForConnection group=__recv__group
LocatorLoadSnapshot.getServerForConnection group=__recv__group; 
potentialServers={192.168.1.5:5409@192.168.1.5(ln-1:81083):41002=LoadHolder[0.0,
 192.168.1.5:5409, loadPollInterval=5000, 0.00125]}
LoadHolder.incConnections location=192.168.1.5:5409; load=0.00125
LocatorLoadSnapshot.getServerForConnection group=__recv__group; 
usingServer=192.168.1.5:5409

Receiver server behavior:

When a receiver gets a new connection, the ServerConnection.processHandShake 
updates the LoadMonitor. The LoadMonitor explicitly does not update the 
connection count because isClientOperations=false, so the load is never changed 
on the server. This is interesting behavior. I'm not sure if it wasn't updated 
because it was unnecessary given the behavior of the locator above.

ServerConnection.processHandShake about to update the LoadMonitor 
communicationMode=gateway
LoadMonitor.connectionOpened isClientOperations=false; 
type=GatewayReceiverStatistics
LoadMonitor.connectionOpened did not increment connectionCount=0; 
type=GatewayReceiverStatistics
ServerConnection.processHandShake done update the LoadMonitor

The LoadMonitor on the server does send the load periodically to the locator but 
only because skippedLoadUpdates>forceUpdateFrequency (which is 10 times through 
the polling loop by default):

PollingThread.run got load type=GatewayReceiverStatistics; load=Load(0.0, 
0.00125, 0.0, 1.0)
PollingThread.run forceUpdateFrequency=true
PollingThread.run about to send CacheServerLoadMessage 
type=Gatew

Question related to gateway-receivers connection load balancing

2022-03-10 Thread Jakov Varenina

Hi devs,

We have observed some weird behavior related to load balancing of 
gateway-receivers connections in the geode cluster. Currently, 
gateway-receiver connection load is only updated on coordinator locator 
when it provides server location to remote gateway-sender in 
ClientConnectionRequest{group=__recv_group...}/ClientConnectionResponse 
messages exchange. Other locators never update gateway-receiver 
connection load, since they are not handling these messages. 
Additionally, locators (including the coordinator) ignore 
CacheServerLoadMessage messages that are carrying the receiver's 
connection load. This means that locators will not adjust the load when 
the connection on some receiver is shut down.


Is this expected behavior or this is a bug?

You can find more information in this PR:

https://github.com/apache/geode/pull/7378#issuecomment-1048513322

and ticket:

https://issues.apache.org/jira/browse/GEODE-10056

Thanks,

Jakov



Re: Question regarding geode thread priorities

2022-02-01 Thread Jakov Varenina

Hi Kirk,

No problem and thank you for the answer!

Best Regards :),

Jakov

On 27. 01. 2022. 00:42, Kirk Lund wrote:

PS: Sorry if I didn't realize what BRs is :D

On Wed, Jan 26, 2022 at 3:39 PM Kirk Lund  wrote:


Hi BRs/Jakov,

I'm familiar with most of these threads, and these ones I know of do not
spawn more than one thread total. Most of these are quite old, possibly
predating Executors in Java. I doubt using max priority is important for
these threads, but you should probably do some perf testing if you want to
remove setMaxPriority. I recommend using
https://github.com/apache/geode-benchmarks as well as writing targeted
JMH micro-benchmarks.

Cheers,
Kirk

On Mon, Jan 24, 2022 at 3:50 AM Jakov Varenina 
wrote:


Hi community,

We have came across to some code in geode that prioritizes some of the
threads using

https://cr.openjdk.java.net/~iris/se/11/latestSpec/api/java.base/java/lang/Thread.html#setPriority(int).

Below you can find links to code.


https://github.com/apache/geode/blob/41eb49989f25607acfcbf9ac5afe3d4c0721bb35/geode-core/src/main/java/org/apache/geode/internal/statistics/HostStatSampler.java#L304


https://github.com/apache/geode/blob/41eb49989f25607acfcbf9ac5afe3d4c0721bb35/geode-core/src/main/java/org/apache/geode/internal/cache/control/OffHeapMemoryMonitor.java#L92


https://github.com/apache/geode/blob/d79a3c78eab96a9e760db07fa42580e61586b9c5/geode-core/src/main/java/org/apache/geode/internal/cache/control/InternalResourceManager.java#L147


https://github.com/apache/geode/blob/a5bd36f9fa787d3a71c6e6efafed5a7b0fe52d2b/geode-core/src/main/java/org/apache/geode/internal/tcp/TCPConduit.java#L343

Just to add that every new thread inherits parent thread priority, so
that means that there will be more thread with max priority in addition
to the above threads. Does somebody know why this is set for these
particular threads?

Additionally, in multiple online resources it is indicated that these
priories are not taken into the account by the Linux scheduler unless
additional parameters in JVM are set (UseThreadPriorities and
ThreadPriorityPolicy), please check links for more information's:

https://github.com/openjdk/jdk/blob/jdk8-b120/hotspot/src/share/vm/runtime/globals.hpp#L3369,L3392
and

https://github.com/openjdk/jdk/blob/jdk8-b120/hotspot/src/os/linux/vm/os_linux.cpp#L3961,L3966

Are these priorities that are set in Apache Geode code crucial and
should be enabled for better performance, or they shouldn't be used?
Also did I maybe miss something and these priorities are somehow used
even without setting mentioned JVM parameters?

Any help on this topic is welcome and sorry for bothering!

BRs/Jakov




Question regarding geode thread priorities

2022-01-24 Thread Jakov Varenina

Hi community,

We have came across to some code in geode that prioritizes some of the 
threads using 
https://cr.openjdk.java.net/~iris/se/11/latestSpec/api/java.base/java/lang/Thread.html#setPriority(int). 
Below you can find links to code.


https://github.com/apache/geode/blob/41eb49989f25607acfcbf9ac5afe3d4c0721bb35/geode-core/src/main/java/org/apache/geode/internal/statistics/HostStatSampler.java#L304

https://github.com/apache/geode/blob/41eb49989f25607acfcbf9ac5afe3d4c0721bb35/geode-core/src/main/java/org/apache/geode/internal/cache/control/OffHeapMemoryMonitor.java#L92

https://github.com/apache/geode/blob/d79a3c78eab96a9e760db07fa42580e61586b9c5/geode-core/src/main/java/org/apache/geode/internal/cache/control/InternalResourceManager.java#L147

https://github.com/apache/geode/blob/a5bd36f9fa787d3a71c6e6efafed5a7b0fe52d2b/geode-core/src/main/java/org/apache/geode/internal/tcp/TCPConduit.java#L343

Just to add that every new thread inherits parent thread priority, so 
that means that there will be more thread with max priority in addition 
to the above threads. Does somebody know why this is set for these 
particular threads?


Additionally, in multiple online resources it is indicated that these 
priories are not taken into the account by the Linux scheduler unless 
additional parameters in JVM are set (UseThreadPriorities and 
ThreadPriorityPolicy), please check links for more information's: 
https://github.com/openjdk/jdk/blob/jdk8-b120/hotspot/src/share/vm/runtime/globals.hpp#L3369,L3392 
and 
https://github.com/openjdk/jdk/blob/jdk8-b120/hotspot/src/os/linux/vm/os_linux.cpp#L3961,L3966


Are these priorities that are set in Apache Geode code crucial and 
should be enabled for better performance, or they shouldn't be used? 
Also did I maybe miss something and these priorities are somehow used 
even without setting mentioned JVM parameters?


Any help on this topic is welcome and sorry for bothering!

BRs/Jakov



Re: Question related to orphaned .drf files in disk-store

2021-12-03 Thread Jakov Varenina

Hi Anthony,

Not sure normally, but at the moment when we were investigating the 
issue there were 21 .crf files in disk-store (on one server) with 
default max-oplog-size (1GB) and compaction-threshold (50%).


BRs/Jakov

On 02. 12. 2021. 17:06, Anthony Baker wrote:

Related but different question:  how many active oplogs do you normally see at 
one time?  You may want to adjust the max-oplog-size if the default of 1 GB is 
too small.

On Dec 2, 2021, at 1:11 AM, Jakov Varenina 
mailto:jakov.varen...@est.tech>> wrote:

Hi Dan,

We forget to mention that we actually configure off-heap for the regions, so 
cache entry values are stored outside the heap memory. Only Oplog objects that 
are not compacted and that have .crf file are referencing the live entries from 
the cache. These Oplog objects are not stored in onlyDrfOplogs hashmap. In 
onlyDrfOplogs map are only Oplog objects that are representing orphaned .drf 
files (the one without accompanying .crf and .krf file). These objects have 
been compacted and doesn't contain a reference to any live entry from the 
cache. All of these 18G is actually occupied by empty pendingKrfTags hashmaps.

In this case there are 7680 Oplog objects stored in onlyDrfOplogs. Every Oplog 
object has it's own regionMap hashmap. Every regionMap can contain hundreds of 
empty pendingKrfTags hashMaps. When you bring all that together you get more 
then 18G of unnecessary heap memory.

Thank you all for quick review of PR and fast response to our questions!

BRs/Jakov


On 02. 12. 2021. 00:25, Dan Smith wrote:
Interesting. It does look like that pendingKrfTags structure is wasting memory.

I think that retained heap of 20 gigabytes might include your entire cache, 
because those objects have references back to the Cache object. However with 6K 
oplogs each having an empty map with 4K elements that does add up.

-Dan
----
*From:* Jakov Varenina mailto:jakov.varen...@est.tech>>
*Sent:* Tuesday, November 30, 2021 5:53 AM
*To:* dev@geode.apache.org<mailto:dev@geode.apache.org> 
mailto:dev@geode.apache.org>>
*Subject:* Re: Question related to orphaned .drf files in disk-store

Hi Dan and all,


Just to provide you the additional picture that better represents the severity 
of the problem with pendingKrfsTag. So when after you check the second picture 
in below mail, then please come back and check this one also. Here you can see 
that the pendingKerfTags is empty and has capacity of 9,192 allocated in memory.



Sorry for any inconvenience.

BRs/Jakov


On 30. 11. 2021. 09:32, Jakov Varenina wrote:

Hi Dan,


Thank you for your answer!


We have identify memory leak in Oplog objects that are representing orphaned 
.drf files in heap memory. In below screenshoot you can see that 7680 
onlyDrfOplogs consume more than 18 GB of heap which doesn't seem correct.



In below picture you can see that the drfOnlyPlog.Oplog.regionMap.pendingKrfTgs 
structure is responsible for more then 95% of drfOnlyOplogs heap memory.




The pendingKrfTags structure is actually empty (although it consumes memory 
because it was used previously and the size of the HashMap was not reduced) and 
not used by the onlyDrfOplogs objects. Additionally, the regionMap.liveEntries 
linked list has just one element (fake disk entry OlogDiskEntry indicating that 
it is empty) and it is also not used. You can find more details why 
pedingKrfTags sturcture remianed in memory for Oplog object representing 
orphaned .drf file and possible solution in the following ticket and the PR:


https://issues.apache.org/jira/browse/GEODE-9854 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9854&data=04%7C01%7Cbakera%40vmware.com%7Cbd2758bd10a54592fcef08d9b573cebe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637740331264764503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=hEhzZk%2FZbFH%2B8812D7iRIU9ywNdV5CyW752HyvU3Tgo%3D&reserved=0>

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7145&data=04%7C01%7Cbakera%40vmware.com%7Cbd2758bd10a54592fcef08d9b573cebe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637740331264764503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Ua49Y%2F4PwKhQHHgz898uxDLde%2BZpZZFxMBY%2FgIL8%2BEE%3D&reserved=0
 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7145&data=04%7C01%7Cbakera%40vmware.com%7Cbd2758bd10a54592fcef08d9b573cebe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637740331264764503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Ua49Y%2F4PwKhQHHgz898uxDLde%2BZpZZFxMBY%2FgIL8%2BEE%3D&reserved=0>


BRs/Jakov

Re: Question related to orphaned .drf files in disk-store

2021-12-02 Thread Jakov Varenina

Hi Dan,

We forget to mention that we actually configure off-heap for the 
regions, so cache entry values are stored outside the heap memory. Only 
Oplog objects that are not compacted and that have .crf file are 
referencing the live entries from the cache. These Oplog objects are not 
stored in onlyDrfOplogs hashmap. In onlyDrfOplogs map are only Oplog 
objects that are representing orphaned .drf files (the one without 
accompanying .crf and .krf file). These objects have been compacted and 
doesn't contain a reference to any live entry from the cache. All of 
these 18G is actually occupied by empty pendingKrfTags hashmaps.


In this case there are 7680 Oplog objects stored in onlyDrfOplogs. Every 
Oplog object has it's own regionMap hashmap. Every regionMap can contain 
hundreds of empty pendingKrfTags hashMaps. When you bring all that 
together you get more then 18G of unnecessary heap memory.


Thank you all for quick review of PR and fast response to our questions!

BRs/Jakov


On 02. 12. 2021. 00:25, Dan Smith wrote:
Interesting. It does look like that pendingKrfTags structure is 
wasting memory.


I think that retained heap of 20 gigabytes might include your entire 
cache, because those objects have references back to the Cache object. 
However with 6K oplogs each having an empty map with 4K elements that 
does add up.


-Dan
----
*From:* Jakov Varenina 
*Sent:* Tuesday, November 30, 2021 5:53 AM
*To:* dev@geode.apache.org 
*Subject:* Re: Question related to orphaned .drf files in disk-store

Hi Dan and all,


Just to provide you the additional picture that better represents the 
severity of the problem with pendingKrfsTag. So when after you check 
the second picture in below mail, then please come back and check this 
one also. Here you can see that the pendingKerfTags is empty and has 
capacity of 9,192 allocated in memory.




Sorry for any inconvenience.

BRs/Jakov


On 30. 11. 2021. 09:32, Jakov Varenina wrote:


Hi Dan,


Thank you for your answer!


We have identify memory leak in Oplog objects that are representing 
orphaned .drf files in heap memory. In below screenshoot you can see 
that 7680 onlyDrfOplogs consume more than 18 GB of heap which doesn't 
seem correct.




In below picture you can see that the 
drfOnlyPlog.Oplog.regionMap.pendingKrfTgs structure is responsible 
for more then 95% of drfOnlyOplogs heap memory.





The pendingKrfTags structure is actually empty (although it consumes 
memory because it was used previously and the size of the HashMap was 
not reduced) and not used by the onlyDrfOplogs objects. Additionally, 
the regionMap.liveEntries linked list has just one element (fake disk 
entry OlogDiskEntry indicating that it is empty) and it is also not 
used. You can find more details why pedingKrfTags sturcture remianed 
in memory for Oplog object representing orphaned .drf file and 
possible solution in the following ticket and the PR:



https://issues.apache.org/jira/browse/GEODE-9854 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9854&data=04%7C01%7Cdasmith%40vmware.com%7Cb7d1039e109443468fb508d9b408cd8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637738772194842172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6Vulpvvh4LsjagU7julIxqYp5%2F%2FkIBzcOikG8jrOKWc%3D&reserved=0>


https://github.com/apache/geode/pull/7145 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7145&data=04%7C01%7Cdasmith%40vmware.com%7Cb7d1039e109443468fb508d9b408cd8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637738772194852171%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nA%2FNGaOEg1sR8yYpRUHkhfupciFhhMfwiPyhGv%2BSHnw%3D&reserved=0>



BRs/Jakov



On 24. 11. 2021. 23:12, Dan Smith wrote:
The .drf file contains destroy records for entries in any older 
oplog. So even if the corresponding .crf file has been deleted, the 
.drf file with the same number still needs to be retained until the 
older .crf files are all deleted.


7680 does seem like a lot of oplogs. That data structure is just 
references to the files themselves, I don't think we are keeping the 
contents of the .drf files in memory, except during recovery time.


-Dan
----
*From:* Jakov Varenina  
<mailto:jakov.varen...@est.tech>

*Sent:* Wednesday, November 24, 2021 11:13 AM
*To:* dev@geode.apache.org <mailto:dev@geode.apache.org> 
 <mailto:dev@geode.apache.org>

*Subject:* Question related to orphaned .drf files in disk-store
Hi devs,

We have noticed that disk-store folder can contain orphaned .drf 
files (only .drf file without accompanying .crf and .krf file with 
the same

Re: Exception when trying to use Continuous Queries in geode-wan module distributed test

2021-05-11 Thread Jakov Varenina
You can ignore this question, because I have been able to resolve it by 
adding geode-cq dependency to geode-wan/build.gradle


Sorry to bother you all like this.

On 11. 05. 2021. 17:57, Jakov Varenina wrote:

Hi all,

I'm trying to create distributed test in geode-wan module that 
combines WAN and CQ functionality. When I execute the test case, I 
always get the following error:


 java.lang.IllegalStateException: CqService is not available.
    at 
org.apache.geode.cache.query.internal.cq.MissingCqService.start(MissingCqService.java:158)
    at 
org.apache.geode.cache.query.internal.DefaultQueryService.getCqService(DefaultQueryService.java:829)
    at 
org.apache.geode.cache.query.internal.DefaultQueryService.newCq(DefaultQueryService.java:569)
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderOperationClusterConfigDUnitTest.lambda$createDurableCQs$96c20508$1(ParallelGatewaySenderOperationClusterConfigDUnitTest.java:126)


It seems that the CqService is statically initialized using java 
service provider interface:


- service provider class: 
https://github.com/apache/geode/blob/a5bd36f9fa787d3a71c6e6efafed5a7b0fe52d2b/geode-core/src/main/java/org/apache/geode/cache/query/internal/cq/CqServiceProvider.java


- meta-inf.services file: 
https://github.com/apache/geode/blob/a5bd36f9fa787d3a71c6e6efafed5a7b0fe52d2b/geode-cq/src/main/resources/META-INF/services/org.apache.geode.cache.query.internal.cq.spi.CqServiceFactory


So if I understood above exception correctly, the CqService cannot be 
used in distributed test in geode-wan module, because it won't be 
statically initialized in that case. Is this correct?


Do you maybe know how to overcome this issue?

Please feel free to ask any additional information or correct me if I 
misunderstood this exception completely.


BRs/Jakov







Exception when trying to use Continuous Queries in geode-wan module distributed test

2021-05-11 Thread Jakov Varenina

Hi all,

I'm trying to create distributed test in geode-wan module that combines 
WAN and CQ functionality. When I execute the test case, I always get the 
following error:


 java.lang.IllegalStateException: CqService is not available.
    at 
org.apache.geode.cache.query.internal.cq.MissingCqService.start(MissingCqService.java:158)
    at 
org.apache.geode.cache.query.internal.DefaultQueryService.getCqService(DefaultQueryService.java:829)
    at 
org.apache.geode.cache.query.internal.DefaultQueryService.newCq(DefaultQueryService.java:569)
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderOperationClusterConfigDUnitTest.lambda$createDurableCQs$96c20508$1(ParallelGatewaySenderOperationClusterConfigDUnitTest.java:126)


It seems that the CqService is statically initialized using java service 
provider interface:


- service provider class: 
https://github.com/apache/geode/blob/a5bd36f9fa787d3a71c6e6efafed5a7b0fe52d2b/geode-core/src/main/java/org/apache/geode/cache/query/internal/cq/CqServiceProvider.java


- meta-inf.services file: 
https://github.com/apache/geode/blob/a5bd36f9fa787d3a71c6e6efafed5a7b0fe52d2b/geode-cq/src/main/resources/META-INF/services/org.apache.geode.cache.query.internal.cq.spi.CqServiceFactory


So if I understood above exception correctly, the CqService cannot be 
used in distributed test in geode-wan module, because it won't be 
statically initialized in that case. Is this correct?


Do you maybe know how to overcome this issue?

Please feel free to ask any additional information or correct me if I 
misunderstood this exception completely.


BRs/Jakov







Question regarding VersionRequest/VersionResponse messages

2021-03-09 Thread Jakov Varenina

Hi community,

I have one question regarding VersionRequest/VersionResponse messages.

Before member sends actual message, it has to first determine the remote 
member version. This is done by exchanging 
/VersionRequest///VersionResponse/ messages using function 
/getServerVersion() /from class /TcpClient.java/. There is part of code 
in /getServerVersion()/ for which I'm unsure which case is actually 
covering:


https://github.com/apache/geode/blob/854456c81ca7b9545eba252b6fa075318bb33af8/geode-tcp-server/src/main/java/org/apache/geode/distributed/internal/tcpserver/TcpClient.java#L289 



try {
  final Object readObject =objectDeserializer.readObject(versionedIn); if 
(!(readObjectinstanceof VersionResponse)) {
throw new IllegalThreadStateException(
"Server version response invalid: This could be the result of trying to 
connect a non-SSL-enabled client to an SSL-enabled locator."); }


  final VersionResponse response = (VersionResponse) readObject; serverVersion 
= response.getVersionOrdinal(); serverVersions.put(address, serverVersion); 
return serverVersion; }catch (EOFException ex) {
  // old locators will not recognize the version request and will close 
the connection }

...
return KnownVersion.OLDEST.ordinal();

The case is when /readObject()/ try to read /VersionResponse/ and then 
throws /EOFException/. As you can see, there is comment in catch 
statement explaining the case, but I'm not sure that I understand it 
correctly. What I assume is that member with old version (less or equal 
to /KnownVersion.OLDEST/) of Geode does not support /VersionRequest/ 
message, and will close connection if receive such message. This will 
result with /EOFException/ on remote member that sent /VersionRequest/. 
Is this correct understanding? Is this case still valid with the latest 
non-deprecated version set to /GFE_81/?


The reason why I'm asking this is because in some cases in kuberentes 
cluster, it is possible to get /EOFException/ when remote member is not 
yet accessible. In this case member will still try to send message (e.g. 
/RemoteLocatorJoinRequest/) thinking that it is old member on the other 
side, and that will then result with /ClassCastException/ when reading 
response (e.g. /RemoteLocatorJoinResponse/).


BRs,

Jakov



Re: Request for rights to write comments on Apache Geode Confluence

2021-01-11 Thread Jakov Varenina

Thank you, Dan!

BRs,

Jakov

On 11. 01. 2021. 18:40, Dan Smith wrote:

Done. You should have access now.

Thanks!
-Dan

From: Jakov Varenina 
Sent: Monday, January 11, 2021 2:09 AM
To: dev@geode.apache.org 
Subject: Request for rights to write comments on Apache Geode Confluence

Hi devs,

Could you please give me rights to write comments on Apache Geode
Confluence page?

username: jakov.varenina

BRs,

Jakov




Request for rights to write comments on Apache Geode Confluence

2021-01-11 Thread Jakov Varenina

Hi devs,

Could you please give me rights to write comments on Apache Geode 
Confluence page?


username: jakov.varenina

BRs,

Jakov



Feature proposal: Persist gateway-sender state within Cluster Configuration

2021-01-07 Thread Jakov Varenina

Hi all,

We would like to propose a new feature:

https://cwiki.apache.org/confluence/display/GEODE/Persist+gateway-sender+state+within+Cluster+Configuration

Could you please check it and comment?

BRs,

Jakov



Re: Member that is shutting down initiate removal of other members from the cluster

2020-10-20 Thread Jakov Varenina

Hi Ernie,

Thank you really much for trying to help.

Unfortunately we haven't been able to reproduce this issue without k8s.

I would like to share some additional info about this issue. We know 
that this is not the "correct" way to shut down member gracefully, since 
it's complete TCP communication towards the other members is forcefully 
terminated even before it receives shutdown indication. This issue with 
removals from previous mail can be easily avoided by enforcing policy 
that the TCP connections towards other members aren't terminated until 
member terminates them by itself during graceful shutdown.


Another thing is that the removals occur only when when we shut down 
complete k8s node as described in previous mail. When we reproduce the 
same situation without shutting down k8s node then the removals aren't 
initiated:


"In parallel terminate member TCP communication and initiate gracefully 
shut down the member. What happens then is that availability check over 
UDP using Heartbeat messages towards other members pass, and removals 
aren't initiated towards other members"


Given the above additional test result when using k8s with 1.12 release, 
and due to fact that we were unable to reproduce the issue on geode 
only, we are not sure that the problem is actually in geode. Please feel 
free to contact me by slack or mail for any additional questions or 
things you would like to discuss.


BRs,

Jakov

On 19. 10. 2020. 20:26, Ernie Burghardt wrote:

Hi Jakov,

I'm looking into your question(s)... curious if you've run into this in a 
non-k8s cluster?
Might help focus the investigation...

Thanks,
EB

On 10/13/20, 7:51 AM, "Jakov Varenina"  wrote:

 Hi all,

 sorry for bothering, but we have noticed some differences in behavior
 between 1.11 and 1.12 releases and need your help in understanding them.

 First I would like to mention that we are running geode in Kubernetes.
 We perform shutdown of the worker node that is hosting one member(e.g.
 coordinator locator). Shutdown procedure affect member in a following way:

 1. TCP shared unordered connections towards other members are terminated

 2. Member receives graceful shut-down indication and starts with the
 shut-down procedure

 Usually connections starting to be terminated first and the shut-down
 indication comes short after (e.g. ~10 milliseconds in difference). The
 step 1. triggers availability check towards the other members for which
 TCP connection has been previously lost. At this point of time
 coordinator is unaware of ongoing shut-down and assumes that all other
 members are actually having issues due to connection loss. Even after
 coordinator receives the graceful shut-down indication this process of
 availability check is not stopped. What happens later on is that
 availability check fail for all members and coordinator initiates their
 removal with RemoveMemberMessage. This message is succesfully received
 on the other members forcing them to shut-down.

 In geode 1.11 everything is same except the fact that availability check
 pass and therefore removals aren't initiated.

 In logs it can be seen that for both releases TCP availability check
 fail, but HeartbeatMessageRequest/HearbeatMessage check fails only on
 1.12 and pass on 1.11. In 1.12 release it can be seen that heartbeat
 request and heartbeat messages are sent but does not reach their
 destination members. RemoveMemberMessage which are sent later on reach
 their destination successfully. Does anybody know what was changed in
 1.12 that could lead to such difference in behavior?

 Additionally, availability check is not stopped when graceful shutdown
 is initiated. Do you think that this could be improved, so that member
 stops ongoing availability check when detects gracefull shutdown? Just
 to add that shutdown procedure is also delayed due to unsuccessful
 attempts to estabilsh TCP connections towards the other members.

 BRs,
 Jakov




Member that is shutting down initiate removal of other members from the cluster

2020-10-13 Thread Jakov Varenina

Hi all,

sorry for bothering, but we have noticed some differences in behavior 
between 1.11 and 1.12 releases and need your help in understanding them.


First I would like to mention that we are running geode in Kubernetes. 
We perform shutdown of the worker node that is hosting one member(e.g. 
coordinator locator). Shutdown procedure affect member in a following way:


1. TCP shared unordered connections towards other members are terminated

2. Member receives graceful shut-down indication and starts with the 
shut-down procedure


Usually connections starting to be terminated first and the shut-down 
indication comes short after (e.g. ~10 milliseconds in difference). The 
step 1. triggers availability check towards the other members for which 
TCP connection has been previously lost. At this point of time 
coordinator is unaware of ongoing shut-down and assumes that all other 
members are actually having issues due to connection loss. Even after 
coordinator receives the graceful shut-down indication this process of 
availability check is not stopped. What happens later on is that 
availability check fail for all members and coordinator initiates their 
removal with RemoveMemberMessage. This message is succesfully received 
on the other members forcing them to shut-down.


In geode 1.11 everything is same except the fact that availability check 
pass and therefore removals aren't initiated.


In logs it can be seen that for both releases TCP availability check 
fail, but HeartbeatMessageRequest/HearbeatMessage check fails only on 
1.12 and pass on 1.11. In 1.12 release it can be seen that heartbeat 
request and heartbeat messages are sent but does not reach their 
destination members. RemoveMemberMessage which are sent later on reach 
their destination successfully. Does anybody know what was changed in 
1.12 that could lead to such difference in behavior?


Additionally, availability check is not stopped when graceful shutdown 
is initiated. Do you think that this could be improved, so that member 
stops ongoing availability check when detects gracefull shutdown? Just 
to add that shutdown procedure is also delayed due to unsuccessful 
attempts to estabilsh TCP connections towards the other members.


BRs,
Jakov



Re: Non-persistent parallel gateway sender on non-persistent region (collocated with persistent region)

2020-07-17 Thread Jakov Varenina

Hi Barrett,

We have suspected that this could be bug, and it is great that you 
confirmed it and this quickly created a fix.


Thank you very much for your effort!

BRs,

Jakov

On 16. 07. 2020. 21:39, Barrett Oglesby wrote:

I think you've found a bug in this scenario. The 
ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR method currently 
compares the data policy of the input region's leader region with the sender's 
persistence policy. It assumes the input region and the leader region have the 
same data policy. In this scenario, that is not the case. The input region is 
'part_a' which is not persistent, and the leader region is '_part_hidden' which 
is persistent. The sender is 'sender' which is not persistent. So, instead of 
comparing the data policy of 'part_a' to the sender which would succeed since 
they are both not persistent, it compares the data policy of '_part_hidden' to 
the sender which fails since one is persistent and one is not.

I made a small change to 
ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR to address the 
issue. I'll file a JIRA and run CI on it to see if it is a valid change. I'll 
also add a test for this scenario.


From: Jakov Varenina 
Sent: Friday, July 10, 2020 3:34 AM
To: dev@geode.apache.org 
Subject: Re: Non-persistent parallel gateway sender on non-persistent region 
(collocated with persistent region)

Hi devs,

just a kind reminder. We would be really grateful if you could take look
at question in below mail.

BRs,

Jakov

On 06. 07. 2020. 15:50, Jakov Varenina wrote:

Hi all,


We are trying to setup non-persistent parallel gateway sender
(‘sender’) on a non-persistent  partitioned region (‘part_a’). This
works OK.
But when this same region ‘part_a’ is colocated with another
persistent region (‘_part_hidden’),  Geode throws an exception:

Exception in thread "main"
org.apache.geode.internal.cache.wan.GatewaySenderException: Non
persistent gateway sender sender can not be attached to persistent
region /_part_hidden
 at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:461)
 at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:451)
 at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:191)
 at
org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:177)
 at
org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1174)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3010)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2869)
 at
org.apache.geode.internal.cache.xmlcache.RegionCreation.createRoot(RegionCreation.java:237)
 at
org.apache.geode.internal.cache.xmlcache.CacheCreation.initializeRegions(CacheCreation.java:658)
 at
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:592)
 at
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:338)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4081)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1535)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1374)
 at
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
 at
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
 at
org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
 at
org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
 at
org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:894)
 at
org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:809)
 at
org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:739)
 at
org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:256)


This is cache.xml used:

https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgeode.apache.org%2Fschema%2Fcache&data=02%7C01%7Cboglesby%40vmware.com%7C6b93b4396d44490cb36c08d824bce120%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637299740942592531&sdata=sJIqN%2BcbftenSAKI3SDiZSgQiuLrq98UrdZbASDo3gY%3D&reserved=0";

xml

Re: Non-persistent parallel gateway sender on non-persistent region (collocated with persistent region)

2020-07-10 Thread Jakov Varenina

Hi devs,

just a kind reminder. We would be really grateful if you could take look 
at question in below mail.


BRs,

Jakov

On 06. 07. 2020. 15:50, Jakov Varenina wrote:

Hi all,


We are trying to setup non-persistent parallel gateway sender 
(‘sender’) on a non-persistent  partitioned region (‘part_a’). This 
works OK.
But when this same region ‘part_a’ is colocated with another 
persistent region (‘_part_hidden’),  Geode throws an exception:


Exception in thread "main" 
org.apache.geode.internal.cache.wan.GatewaySenderException: Non 
persistent gateway sender sender can not be attached to persistent 
region /_part_hidden
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:461)
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:451)
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:191)
    at 
org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:177)
    at 
org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1174)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3010)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2869)
    at 
org.apache.geode.internal.cache.xmlcache.RegionCreation.createRoot(RegionCreation.java:237)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.initializeRegions(CacheCreation.java:658)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:592)
    at 
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:338)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4081)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1535)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1374)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
    at 
org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
    at 
org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
    at 
org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:894)
    at 
org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:809)
    at 
org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:739)
    at 
org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:256)



This is cache.xml used:

xmlns="http://geode.apache.org/schema/cache";

   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
   xsi:schemaLocation="http://geode.apache.org/schema/cache 
http://geode.apache.org/schema/cache/cache-1.0.xsd";

   version="1.0"
   copy-on-read="true">
   
   
   
   
   
  
 
   
org.apache.geode.cache.util.StringPrefixPartitionResolver
 
   
 
  
   
   
  
 redundant-copies="1">

   
org.apache.geode.cache.util.StringPrefixPartitionResolver
 
   
 
  
   
   
  
 redundant-copies="1">

   
org.apache.geode.cache.util.StringPrefixPartitionResolver
 
   
 
  
   



There is nothing explicitly said about this in documentation, and It 
is not clear why this is not allowed.
Non-persistent parallel gateway sender is attached only to 
non-persistent  region ‘part_a’ (and not to persistent  region 
‘_part_hidden’) .


Why is this not allowed by Geode? Is there any way around this issue?

Geode version: 1.12, 1.11

BRs,

Jakov



Non-persistent parallel gateway sender on non-persistent region (collocated with persistent region)

2020-07-06 Thread Jakov Varenina

Hi all,


We are trying to setup non-persistent parallel gateway sender (‘sender’) 
on a non-persistent  partitioned region (‘part_a’). This works OK.
But when this same region ‘part_a’ is colocated with another persistent 
region (‘_part_hidden’),  Geode throws an exception:


Exception in thread "main" 
org.apache.geode.internal.cache.wan.GatewaySenderException: Non 
persistent gateway sender sender can not be attached to persistent 
region /_part_hidden
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:461)
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:451)
    at 
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:191)
    at 
org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:177)
    at 
org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1174)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3010)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2869)
    at 
org.apache.geode.internal.cache.xmlcache.RegionCreation.createRoot(RegionCreation.java:237)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.initializeRegions(CacheCreation.java:658)
    at 
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:592)
    at 
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:338)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4081)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1535)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1374)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
    at 
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
    at 
org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
    at 
org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
    at 
org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:894)
    at 
org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:809)
    at 
org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:739)
    at 
org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:256)



This is cache.xml used:

xmlns="http://geode.apache.org/schema/cache";

   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
   xsi:schemaLocation="http://geode.apache.org/schema/cache 
http://geode.apache.org/schema/cache/cache-1.0.xsd";

   version="1.0"
   copy-on-read="true">
   
   
   
   
   
  
 
   
org.apache.geode.cache.util.StringPrefixPartitionResolver
 
   
 
  
   
   
  
 redundant-copies="1">

   
org.apache.geode.cache.util.StringPrefixPartitionResolver
 
   
 
  
   
   
  
 redundant-copies="1">

   
org.apache.geode.cache.util.StringPrefixPartitionResolver
 
   
 
  
   



There is nothing explicitly said about this in documentation, and It is 
not clear why this is not allowed.
Non-persistent parallel gateway sender is attached only to 
non-persistent  region ‘part_a’ (and not to persistent  region 
‘_part_hidden’) .


Why is this not allowed by Geode? Is there any way around this issue?

Geode version: 1.12, 1.11

BRs,

Jakov



Re: Odg: Certificate Based Authorization

2020-06-23 Thread Jakov Varenina

Hi Jake and all,

Great findings and analysis Jake! Thank you very much for you effort!

We haven't gone far with the implementation of the solution described in 
the research paper. So it is a great that you have found alternative and 
better solution, but it seems that the attachment with patch is missing 
from you mail.


Also, we agree with your points that it is better to read certificate on 
server side and forward it as a whole to the SecurityManager 
authenticate, since it adds more benefit and flexibility to the users 
than just having CN and SAN.


Could you please just clarify a bit more your opinion on this below?

   > On Jun 19,  2020, at 2:53 PM, Jacob Barrett  wrote:

   > ... Personally I would be inclined to leave RMI out of the solution
   initially. Second I would use this private variable to compete the
   support in OpenJDK..

If I correctly understood and we leave RMI out of the solution that 
would mean one of the following scenarios:


1) Geode would have to use existing username/password authentication and 
authorization feature just for RMI connections, and the new kind of 
certificate auth for all other interfaces. This way user will still have 
to handle usernames/passwords which we want to get rid of, and also this 
would complicate a little bit more implementation of SecurityManager 
interface (user would have to deal with both certificates and 
username/passwords).


2) If geode doesn't use username/password feature and the certificate 
based auth is enabled, then it will have to reject all RMI connections, 
since the clients initiating those RMI connections cannot be properly 
authenticated and authorized on the server side.


BRs/Jakov

On 22. 06. 2020. 23:11, Jacob Barrett wrote:
I went on a little journey to see if it was possible and it looks 
promising. I was able to get access to the SSLSocket and thus the 
SSLContext.


Proof of concept patch attached.



> On Jun 19, 2020, at 2:53 PM, Jacob Barrett  wrote:
>
> So I can see why this research paper was so bleak about the options 
in trying to get the SSL certificate for the current connection being 
serviced. As they discovered the accept loop in OpenJDK’s (and older 
Oracle implementations) immediately fires the RMI operation to a 
thread pool after connected. This is after SSLSocket would have 
would’ve done the handshake and been passed to any of our validation 
callbacks so stashing anything in a thread local storage is dead.

>
> Good news is deep in the sun.rmi.transport.tcp.TCPTransport there is 
a ThreadLocal that has the socket used to establish 
the connection and this thread local is set before each invocation of 
an RMI operation. The bad news is that it's private on an internal 
class. I think this is where the age of the research is in our favor. 
Back when I think it was writing we didn’t have OpenJDK. We had 
Oracle, IBM, and a few others. Now with everything pretty much 
converging on OpenJDK I don’t believe it as as nasty to go poke at 
this internal using reflection. I think it is less dirty then their 
nasty trick of utilizing the IPv6 address as a unique identifier in a 
custom Socket.

>
> Once we have the SSLSocket for this connection then we are golden. 
From there you have public API access to the SSLSession.

>
> Looking at the OpenJDK source this class has largely been unchanged 
since its initial import into the repo in 2007. Most importantly the 
private member in question has been and its sill available in all 
versions of OpenJDK. Sure this limits us to OpenJDK support for 
certificate based authentication by SSL handshake via RMI but in Geode 
that’s really only gfsh. This is a really small surface area. With the 
focus being on converting gfsh activities into REST APIs this surface 
area is shrinking. Personally I would be inclined to leave RMI out of 
the solution initially. Second I would use this private variable to 
compete the support in OpenJDK.

>
> -Jake
>
>
>> On Jun 19, 2020, at 11:14 AM, Jacob Barrett  
wrote:

>>
>>
>>
>>>
>>> On Jun 18, 2020, at 4:24 AM, Jakov Varenina 
mailto:jakov.varen...@est.tech>> wrote:

>>>
>>> In order to completely remove the need for username/password, it 
is required that we implement this new kind of authorization on *all* 
geode interfaces/components (cluster, gateway, web, jmx, locator, 
server). The reason why we didn't have any progress is because we 
faced major obstacle during development when we tried to retrieve 
clients certificate from RMI connections (e.g. jmx connections). It 
seems there are no easy/nice way to retrieve it, and what we came up 
so far is following:

>>>
>>> 1) We have found some possible "hack solution" that could be 
implemented and it is described in the following paper 
(https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcitese

Re: Odg: Certificate Based Authorization

2020-06-18 Thread Jakov Varenina

Hi Anthony and all,

I have been working with Mario on this feature. Let me first answer the 
questions:


1) Multi-user authentication will not be supported when using this new 
kind of SecurityManager implementation.


2) The idea was to use only CN for principal, and ignore SAN (this would 
be documented). But we could as you suggested forward both, or even 
whole certificate, and let the user to decide which one to use. 
According to RFC 6125 
<https://tools.ietf.org/html/rfc6125#section-6.4.4>, SAN is not 
replacement for CN but they complement each other.


In order to completely remove the need for username/password, it is 
required that we implement this new kind of authorization on *all* geode 
interfaces/components (cluster, gateway, web, jmx, locator, server). The 
reason why we didn't have any progress is because we faced major 
obstacle during development when we tried to retrieve clients 
certificate from RMI connections (e.g. jmx connections). It seems there 
are no easy/nice way to retrieve it, and what we came up so far is 
following:


1) We have found some possible "hack solution" that could be implemented 
and it is described in the following paper 
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.224.2915&rep=rep1&type=pdf). 
We have started to work on the prototype that will implement this solution.


2) Second idea is that client reads CN (principal) from certificate and 
sends it towards server in a same way as Username/Password is sent now 
over RMI connections. The downside of this solution is that server 
doesn't verify the originator identity, that is, will not compare 
received principal with one received within the client certificate. With 
this solution, client's that uses certificates signed by geode trusted 
Certificate Authority and with lower authorization privileges, can 
"hack" client implementation and send principal with higher ( then which 
was given to them with a certificate) authorization privileges over RMI 
connection, since that principal will not compared with one received in 
the certificate.


We would be really grateful if community would give us feedback if any 
of this is acceptable? Any new solution proposal or hint would be really 
appreciated. We have already sent a question to @dev@geode regarding 
this issue: https://www.mail-archive.com/dev@geode.apache.org/msg24083.html


BRs,

Jakov


On 16. 06. 2020. 19:36, Anthony Baker wrote:

Hi Mario, just curious if you’ve made any progress on this as of yet.  I have a 
few questions:

1) What is the implication for multi-user auth? Would this just become a no-op 
for this kind of SecurityManager implementation?  See [1][2].

2) I’m not sure that the CN is sufficiently general.  What if I want to use the 
SAN for the Principal?  Can we forward the entire certificate to the the 
authenticate [3] callback?


Anthony

[1] 
https://geode.apache.org/docs/guide/19/basic_config/the_cache/managing_a_multiuser_cache.html
[2] 
https://geode.apache.org/releases/latest/javadoc/org/apache/geode/cache/client/ClientCache.html#createAuthenticatedView-java.util.Properties-
[3] 
https://geode.apache.org/releases/latest/javadoc/org/apache/geode/security/SecurityManager.html#authenticate-java.util.Properties-


On Dec 6, 2019, at 9:06 AM, Jens Deppe (Pivotal) 
mailto:jde...@pivotal.io>> wrote:

Thanks for the write-up. I think it does require a bit of clarification
around how the functionality is enabled.

You've stated:

For client connections, we could presume that certificate based
authorization should be used if both features are enabled, but the client
cache properties don’t provide credentials
(security-username/security-password).


Currently, the presence of any '*auth-init' parameters, does not
necessarily require setting *security-username/password* (although almost
all implementations of AuthInitialize probably do use them). So this
condition will not be sufficient to enable this new functionality.

Although we already have so many parameters I think that having an explicit
parameter, to enable this feature, will avoid any possible confusion.

I'm wondering whether, for an initial deliverable, we should require
*ssl-enabled-components=all*. This would not allow a mix of different forms
of authentication for different endpoints. Perhaps this might simplify the
implementation but would not preclude us from adding that capability in the
future.

--Jens

On Fri, Dec 6, 2019 at 1:13 AM Mario Kevo 
mailto:mario.k...@est.tech>> wrote:

Hi all,

I wrote up a proposal for Certificate Based Authorization.
Please review and comment on the below proposal.


https://cwiki.apache.org/confluence/display/GEODE/Certificate+Based+Authorization

BR,
Mario

Šalje: Udo Kohlmeyer 
Poslano: 2. prosinca 2019. 20:10
Prima: dev@geode.apache.org 
Predmet: Re: Certificate Based Authorization

+1

On 12/2/19 1:29 AM, M

Re: Unable to get behavior described in documentation when using durable native client

2020-04-23 Thread Jakov Varenina

Hi Jacob,

Native durable client reconnects to the servers hosting the queue in a 
following way:


1. Always sends QueueConnectionRequest (redundant=-1,
   findDurable=false, ClientProxyMembershipID="Not set, except uniqueID
   which is hard-coded to 1") requesting all available servers
   regardless of the value used in "setSubscriptionRedundancy".
2. Native client then sends its ClientProxyMembershipID in handshake
   towards received servers. Each server based on
   ClientProxyMembershipID will return within handshake indication if
   it is primary, redundant or non-redundant server for the event queue.
 * if it is first time durable client connects then all severs will
   be non-redundant (non of them hosting subscription region
   queue). Then native client algorithm will select primary server
   and number of redundant servers based on
   "setSubscriptionRedundancy" and perform subscriptions to them
   accordingly.
 * if it is reconnect case, then the native client will get
   indication in handshake which of the servers are primary and
   redundant and reconnect to them accordingly.

So it seems that below native client document incorrectly describes 
locator behavior when native client is used.  Maybe it would be good to 
update it to reflect correct behavior?


https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html

    ...

   /When a client registers interest for a region, if the connection
   pool does not already have a subscription channel, the connection
   pool sends a message to the server locator, and the server locator
   chooses servers to host the queue and return those server names to
   the client. The client then contacts the chosen servers and asks
   them to create the queue./

   //
   /For durable subscriptions, the server locator must be able to
   locate the servers that host the queues for the durable client. When
   a durable client sends a request, the server locator queries all the
   available servers to see if they are hosting the subscription region
   queue for the durable client. If the server is located, the client
   is connected to the server hosting the subscription region queue./

BRs,

Jakov

On 20. 04. 2020. 08:10, Jakov Varenina wrote:

Yes I can. IOException is not thrown and the client works in that case.

BRs,

Jakov

On 17. 04. 2020. 16:24, Jacob Barrett wrote:
Can you confirm that when log level less than debug that the 
IOException goes away and the client appears to function?


-Jake


On Apr 17, 2020, at 1:12 AM, Jakov Varenina 
 wrote:


Hi Jacob,

Thanks for your response!

Regarding GEODE-7944, "Unable to deserialize *membership id* 
java.io.EOFException" is not logged but thrown, and it breaks 
processing of QueueConnectionRequest in locator. This reflects in 
native client with "No locators found" even though they are 
available. Happens only when native client with subscription is 
enabled and locator started with --log-level=debug.


I haven't had time to test and analyze in detail native durable 
client yet. So far I could only confirm that when using native 
durable client then locator behaves differently than how it is 
described in documentation (see previous mail) and how java client 
works:


It seems that native durable client always requests from locator all 
available servers (redundant= -1, findDurable=false) with 
QueueConnectionRequest. Locator returns them in 
QueueConnectionResponse ordered by load (best...worst). While for 
java durable client, locator use *membership id *from 
QueueConnectionRequest to locate servers that host client queue and 
send them back in QueueConnectionResponse as described in previous 
mail. I expect that native durable client is handling re-connection 
to same servers queue somehow also, but this has to be investigated 
yet. Any hints or comments related to this would be really appreciated.


BRs,

Jakov

On 15. 04. 2020. 10:07, Jacob Barrett wrote:
Looking back at history the native library has always only ever set 
that findDurable flag to false. I traced it back to its initial 
commit. Aside from the annoying log message, does client durable 
connection work correctly?


On Apr 14, 2020, at 10:56 PM, Jakov Varenina 
 wrote:


Hi all,

Could you please help me understand behavior of the native client 
when configured as durable?


I have been working on a bug GEODE-7944 
<https://issues.apache.org/jira/browse/GEODE-7944> which results 
with exception "Unable to deserialize membership id 
java.io.EOFException" on locator only when debug is enabled. This 
happens because native client, only when subscription is enabled, 
sends towards locator QueueConnectionRequest that doesn't 
encapsulate ClientProxyMembershipID (not properly serialized) and 
therefore exception occurs when locator tries to deserialize 
membership id t

Re: Unable to get behavior described in documentation when using durable native client

2020-04-19 Thread Jakov Varenina

Yes I can. IOException is not thrown and the client works in that case.

BRs,

Jakov

On 17. 04. 2020. 16:24, Jacob Barrett wrote:

Can you confirm that when log level less than debug that the IOException goes 
away and the client appears to function?

-Jake



On Apr 17, 2020, at 1:12 AM, Jakov Varenina  wrote:

Hi Jacob,

Thanks for your response!

Regarding GEODE-7944, "Unable to deserialize *membership id* java.io.EOFException" is not 
logged but thrown, and it breaks processing of QueueConnectionRequest in locator. This reflects in 
native client with "No locators found" even though they are available. Happens only when 
native client with subscription is enabled and locator started with --log-level=debug.

I haven't had time to test and analyze in detail native durable client yet. So 
far I could only confirm that when using native durable client then locator 
behaves differently than how it is described in documentation (see previous 
mail) and how java client works:

It seems that native durable client always requests from locator all available 
servers (redundant= -1, findDurable=false) with QueueConnectionRequest. Locator 
returns them in QueueConnectionResponse ordered by load (best...worst). While 
for java durable client, locator use *membership id *from 
QueueConnectionRequest to locate servers that host client queue and send them 
back in QueueConnectionResponse as described in previous mail. I expect that 
native durable client is handling re-connection to same servers queue somehow 
also, but this has to be investigated yet. Any hints or comments related to 
this would be really appreciated.

BRs,

Jakov

On 15. 04. 2020. 10:07, Jacob Barrett wrote:

Looking back at history the native library has always only ever set that 
findDurable flag to false. I traced it back to its initial commit. Aside from 
the annoying log message, does client durable connection work correctly?


On Apr 14, 2020, at 10:56 PM, Jakov Varenina  wrote:

Hi all,

Could you please help me understand behavior of the native client when 
configured as durable?

I have been working on a bug GEODE-7944 
<https://issues.apache.org/jira/browse/GEODE-7944> which results with exception 
"Unable to deserialize membership id java.io.EOFException" on locator only when debug 
is enabled. This happens because native client, only when subscription is enabled, sends 
towards locator QueueConnectionRequest that doesn't encapsulate ClientProxyMembershipID (not 
properly serialized) and therefore exception occurs when locator tries to deserialize 
membership id to log it at debug level.

I was trying to figure out why would locator need ClientProxyMembershipID from 
native client and found following paragraph in the documentation (copied from 
https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html):

   /For durable subscriptions, the server locator must be able to
   locate the servers that host the queues for the durable client. When
   a durable client sends a request, the server locator queries all the
   available servers to see if they are hosting the subscription region
   queue for the durable client. If the server is located, the client
   is connected to the server hosting the subscription region queue./

Locator behaves as described in above paragraph only when it receives ///QueueConnectionRequest 
with ///findDurable flag set to "true" //and with valid membership i//d. //I noticed that 
unlike java client, the native client always sets //findDurable// to //"false" //and 
therefore locator will never behave as described in above paragraph when native client is used.

Does anybody know why native client always sets //findDurable=false//?

BRs,

Jakov


Re: Unable to get behavior described in documentation when using durable native client

2020-04-17 Thread Jakov Varenina

Hi Jacob,

Thanks for your response!

Regarding GEODE-7944, "Unable to deserialize *membership id* 
java.io.EOFException" is not logged but thrown, and it breaks processing 
of QueueConnectionRequest in locator. This reflects in native client 
with "No locators found" even though they are available. Happens only 
when native client with subscription is enabled and locator started with 
--log-level=debug.


I haven't had time to test and analyze in detail native durable client 
yet. So far I could only confirm that when using native durable client 
then locator behaves differently than how it is described in 
documentation (see previous mail) and how java client works:


It seems that native durable client always requests from locator all 
available servers (redundant= -1, findDurable=false) with 
QueueConnectionRequest. Locator returns them in QueueConnectionResponse 
ordered by load (best...worst). While for java durable client, locator 
use *membership id *from QueueConnectionRequest to locate servers that 
host client queue and send them back in QueueConnectionResponse as 
described in previous mail. I expect that native durable client is 
handling re-connection to same servers queue somehow also, but this has 
to be investigated yet. Any hints or comments related to this would be 
really appreciated.


BRs,

Jakov

On 15. 04. 2020. 10:07, Jacob Barrett wrote:

Looking back at history the native library has always only ever set that 
findDurable flag to false. I traced it back to its initial commit. Aside from 
the annoying log message, does client durable connection work correctly?


On Apr 14, 2020, at 10:56 PM, Jakov Varenina  wrote:

Hi all,

Could you please help me understand behavior of the native client when 
configured as durable?

I have been working on a bug GEODE-7944 
<https://issues.apache.org/jira/browse/GEODE-7944> which results with exception 
"Unable to deserialize membership id java.io.EOFException" on locator only when debug 
is enabled. This happens because native client, only when subscription is enabled, sends 
towards locator QueueConnectionRequest that doesn't encapsulate ClientProxyMembershipID (not 
properly serialized) and therefore exception occurs when locator tries to deserialize 
membership id to log it at debug level.

I was trying to figure out why would locator need ClientProxyMembershipID from 
native client and found following paragraph in the documentation (copied from 
https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html):

   /For durable subscriptions, the server locator must be able to
   locate the servers that host the queues for the durable client. When
   a durable client sends a request, the server locator queries all the
   available servers to see if they are hosting the subscription region
   queue for the durable client. If the server is located, the client
   is connected to the server hosting the subscription region queue./

Locator behaves as described in above paragraph only when it receives ///QueueConnectionRequest 
with ///findDurable flag set to "true" //and with valid membership i//d. //I noticed that 
unlike java client, the native client always sets //findDurable// to //"false" //and 
therefore locator will never behave as described in above paragraph when native client is used.

Does anybody know why native client always sets //findDurable=false//?

BRs,

Jakov


Unable to get behavior described in documentation when using durable native client

2020-04-14 Thread Jakov Varenina

Hi all,

Could you please help me understand behavior of the native client when 
configured as durable?


I have been working on a bug GEODE-7944 
<https://issues.apache.org/jira/browse/GEODE-7944> which results with 
exception "Unable to deserialize membership id java.io.EOFException" on 
locator only when debug is enabled. This happens because native client, 
only when subscription is enabled, sends towards locator 
QueueConnectionRequest that doesn't encapsulate ClientProxyMembershipID 
(not properly serialized) and therefore exception occurs when locator 
tries to deserialize membership id to log it at debug level.


I was trying to figure out why would locator need 
ClientProxyMembershipID from native client and found following paragraph 
in the documentation (copied from 
https://geode.apache.org/docs/geode-native/cpp/112/connection-pools/subscription-properties.html): 



   /For durable subscriptions, the server locator must be able to
   locate the servers that host the queues for the durable client. When
   a durable client sends a request, the server locator queries all the
   available servers to see if they are hosting the subscription region
   queue for the durable client. If the server is located, the client
   is connected to the server hosting the subscription region queue./

Locator behaves as described in above paragraph only when it receives 
///QueueConnectionRequest with ///findDurable flag set to "true" //and 
with valid membership i//d. //I noticed that unlike java client, the 
native client always sets //findDurable// to //"false" //and therefore 
locator will never behave as described in above paragraph when native 
client is used.


Does anybody know why native client always sets //findDurable=false//?

BRs,

Jakov


Re: Request access to JIRA

2020-03-13 Thread Jakov

Thank you Anthony!

BRs/Jakov

On 13. 03. 2020. 13:35, Anthony Baker wrote:

Welcome Jakov!  You’re all set.  Please let us know if you have questions, 
happy to help out!

Anthony



On Mar 13, 2020, at 4:50 AM, Jakov  wrote:

Hi Geode Dev,

Could you please give me access to JIRA, so I could assign to tickets?
My JIRA username is "jvarenina".

BRs,

Jakov



Request access to JIRA

2020-03-13 Thread Jakov

Hi Geode Dev,

Could you please give me access to JIRA, so I could assign to tickets?
My JIRA username is "jvarenina".

BRs,

Jakov