Odg: Colocated regions missing some buckets after restart

2020-09-28 Thread Mario Kevo
Hi Donal,

Sometimes you need to do restart two or three times, but mostly it is 
reproduced by first restart.
start locator --name=locator1 --port=10334
start locator --name=locator2 --port=10335 --locators=localhost[10334]
start server --name=server1 --locators=127.0.0.1[10334],127.0.0.1[10335] 
--server-port=40404
start server --name=server2 --locators=127.0.0.1[10334],127.0.0.1[10335] 
--server-port=40405
I'm putting 1 entries, but you can use a lower value.

You need to be really quick with commands. There is an example from my locator 
log.
[info 2020/09/29 07:41:52.060 CEST  
tid=0x1d] Received a join request from 192.168.0.145(server4:22852):41002
[info 2020/09/29 07:41:52.406 CEST  
tid=0x1d] Received a join request from 192.168.0.145(server3:22879):41003

I prepare commands to start server in two terminals, so I can start them almost 
in the same time.
Sorry, I forgot to mention that you need to see which server is stopped first 
and starts him first (The issue was first reproduced on kubernetes, and that is 
how pods restarts servers).
Also if you are not able to reproduce the issue, try to set 10 or more 
colocated regions.

BR,
Mario


Šalje: Donal Evans 
Poslano: 28. rujna 2020. 23:48
Prima: dev@geode.apache.org 
Predmet: Re: Colocated regions missing some buckets after restart

Hi Mario,

I tried to reproduce the issue using the steps you describe, but I wasn't able 
to. After restarting the servers, all regions have the expected 113 buckets, 
and the server startup process is not noticeably slower. I have a few questions 
that might help understand why I'm unable to reproduce this:

  *   Do you see this behaviour 100% of the time with these steps, or is still 
only on some restarts that it shows up?
  *   Could you describe in more detail how exactly you're starting the 
locators/servers? I'm just using the gfsh "start locator" and "start server" 
commands, only specifying ports, with no other settings, so if you're doing 
anything different that may be a factor.
  *   How many entries are you putting into the region, and does the issue 
still reproduce if you use fewer entries? I'm using 1 entries as described 
in your earlier email.
  *   How quick do you have to be when restarting the servers in the two 
terminals at the same time? I'm currently just manually clicking between them 
and executing the two start server commands within a second of each other, but 
if that's not fast enough then I should probably be using a script or something.

Hopefully if we can understand what's different between what I'm doing and what 
you're doing then it will help us understand exactly what's going wrong.

- Donal

From: Mario Kevo 
Sent: Monday, September 28, 2020 6:23 AM
To: dev@geode.apache.org 
Subject: Odg: Colocated regions missing some buckets after restart

Hi all,

After more investigation I found that for some buckets is problem to define 
which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary 
and after some time return null for it.

From what I found is while doing setHosting( 
grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java])
 it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it 
cannot get response from the other members.

Ticket is opened on GEODE: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8546data=02%7C01%7Cdoevans%40vmware.com%7C4a51a06464f34b8cf6ed08d863b1c66f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637368962530124201sdata=cfW3BI0K906FutWL9QQBDlDharQdK08%2FRY1iUgyImWk%3Dreserved=0
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated 
with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 " and then 
from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest 
region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help 
me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario



Šalje: Mario Kevo 
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org 
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this 
problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. 
This 

Re: Colocated regions missing some buckets after restart

2020-09-28 Thread Donal Evans
Hi Mario,

I tried to reproduce the issue using the steps you describe, but I wasn't able 
to. After restarting the servers, all regions have the expected 113 buckets, 
and the server startup process is not noticeably slower. I have a few questions 
that might help understand why I'm unable to reproduce this:

  *   Do you see this behaviour 100% of the time with these steps, or is still 
only on some restarts that it shows up?
  *   Could you describe in more detail how exactly you're starting the 
locators/servers? I'm just using the gfsh "start locator" and "start server" 
commands, only specifying ports, with no other settings, so if you're doing 
anything different that may be a factor.
  *   How many entries are you putting into the region, and does the issue 
still reproduce if you use fewer entries? I'm using 1 entries as described 
in your earlier email.
  *   How quick do you have to be when restarting the servers in the two 
terminals at the same time? I'm currently just manually clicking between them 
and executing the two start server commands within a second of each other, but 
if that's not fast enough then I should probably be using a script or something.

Hopefully if we can understand what's different between what I'm doing and what 
you're doing then it will help us understand exactly what's going wrong.

- Donal

From: Mario Kevo 
Sent: Monday, September 28, 2020 6:23 AM
To: dev@geode.apache.org 
Subject: Odg: Colocated regions missing some buckets after restart

Hi all,

After more investigation I found that for some buckets is problem to define 
which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary 
and after some time return null for it.

From what I found is while doing setHosting( 
grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java])
 it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it 
cannot get response from the other members.

Ticket is opened on GEODE: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8546data=02%7C01%7Cdoevans%40vmware.com%7C4a51a06464f34b8cf6ed08d863b1c66f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637368962530124201sdata=cfW3BI0K906FutWL9QQBDlDharQdK08%2FRY1iUgyImWk%3Dreserved=0
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated 
with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 " and then 
from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest 
region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help 
me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario



Šalje: Mario Kevo 
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org 
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this 
problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. 
This part of code can be found on this 
link.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario


Šalje: Anilkumar Gingade 
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org 
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can 
find threads stuck in region creation...This will show if there are any lock 
contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo"  wrote:

Hi Anil,

From server logs we see that have some threads stucked and continuosly get 
on server2 the following message(bucket missing on server2 for DfSessions 
region):
[warn 2020/09/15 14:25:39.852 CEST  
tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor 
/__PR/_B__DfSessions_18:935: 

Re: [DISCUSS] One more 1.13 change

2020-09-28 Thread Joris Melchior
+1

On 2020-09-28, 3:21 PM, "Dan Smith"  wrote:

Hi,

I'd like to backport this change to support/1.13 as well

GEODE-8522: Switching exception log back to debug - 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F5566data=02%7C01%7Cjmelchior%40vmware.com%7Cff464688426a4c9a7e6808d863e3b333%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176923816350sdata=mPB6w8RD0e44sLYEgcJCoP%2Fu%2BPepilHevImzjmjfkmA%3Dreserved=0

This cleans up some noise in our logs that customers might see.

[https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Favatars3.githubusercontent.com%2Fu%2F47359%3Fs%3D400%26v%3D4data=02%7C01%7Cjmelchior%40vmware.com%7Cff464688426a4c9a7e6808d863e3b333%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176923816350sdata=IBtTHJOQigbGaDdCWnWogLjjZsHFgEzNDzde5CW7TDQ%3Dreserved=0]
GEODE-8522: Switching exception log back to debug (merge to 1.13) by 
upthewaterspout · Pull Request #5566 · 
apache/geode
This log message happens during the course of normal startup of multiple 
locators. We should not be logging a full stack trace during normal startup. 
(cherry picked from commit 3df057c) Thank you f...
github.com




Re: [DISCUSS] One more 1.13 change

2020-09-28 Thread Mario Kevo
+1

Šalje: Patrick Johnson 
Poslano: 28. rujna 2020. 21:27
Prima: dev@geode.apache.org 
Predmet: Re: [DISCUSS] One more 1.13 change

+1

> On Sep 28, 2020, at 12:21 PM, Dan Smith  wrote:
>
> Hi,
>
> I'd like to backport this change to support/1.13 as well
>
> GEODE-8522: Switching exception log back to debug - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F5566data=02%7C01%7Cjpatrick%40vmware.com%7Cb32ecd430b404e610fe708d863e3b4a5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176927680210sdata=lgNl51lgbcV1LmuihEAhkw8mwxbfAbUqawqsEgwdNUA%3Dreserved=0
>
> This cleans up some noise in our logs that customers might see.
> [https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Favatars3.githubusercontent.com%2Fu%2F47359%3Fs%3D400%26v%3D4data=02%7C01%7Cjpatrick%40vmware.com%7Cb32ecd430b404e610fe708d863e3b4a5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176927680210sdata=KNHUq8JyZOd5KSto0kgg6vJJNcZq3HLTN67uj2xzBg4%3Dreserved=0]
> GEODE-8522: Switching exception log back to debug (merge to 1.13) by 
> upthewaterspout · Pull Request #5566 · 
> apache/geode
> This log message happens during the course of normal startup of multiple 
> locators. We should not be logging a full stack trace during normal startup. 
> (cherry picked from commit 3df057c) Thank you f...
> github.com
>



Re: [DISCUSS] One more 1.13 change

2020-09-28 Thread Patrick Johnson
+1

> On Sep 28, 2020, at 12:21 PM, Dan Smith  wrote:
> 
> Hi,
> 
> I'd like to backport this change to support/1.13 as well
> 
> GEODE-8522: Switching exception log back to debug - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F5566data=02%7C01%7Cjpatrick%40vmware.com%7Cb32ecd430b404e610fe708d863e3b4a5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176927680210sdata=lgNl51lgbcV1LmuihEAhkw8mwxbfAbUqawqsEgwdNUA%3Dreserved=0
> 
> This cleans up some noise in our logs that customers might see.
> [https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Favatars3.githubusercontent.com%2Fu%2F47359%3Fs%3D400%26v%3D4data=02%7C01%7Cjpatrick%40vmware.com%7Cb32ecd430b404e610fe708d863e3b4a5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176927680210sdata=KNHUq8JyZOd5KSto0kgg6vJJNcZq3HLTN67uj2xzBg4%3Dreserved=0]
> GEODE-8522: Switching exception log back to debug (merge to 1.13) by 
> upthewaterspout · Pull Request #5566 · 
> apache/geode
> This log message happens during the course of normal startup of multiple 
> locators. We should not be logging a full stack trace during normal startup. 
> (cherry picked from commit 3df057c) Thank you f...
> github.com
> 



Re: [DISCUSS] One more 1.13 change

2020-09-28 Thread John Blum
+1


From: Dan Smith 
Sent: Monday, September 28, 2020 12:21 PM
To: dev@geode.apache.org 
Subject: [DISCUSS] One more 1.13 change

Hi,

I'd like to backport this change to support/1.13 as well

GEODE-8522: Switching exception log back to debug - 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F5566data=02%7C01%7Cjblum%40vmware.com%7C6c125c591647400fdcd308d863e3b3bb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176913751568sdata=N7j85l7TN0l%2FarLxJkl1%2FtwBLSWrrMfTYVRzF8Xk12s%3Dreserved=0

This cleans up some noise in our logs that customers might see.
[https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Favatars3.githubusercontent.com%2Fu%2F47359%3Fs%3D400%26v%3D4data=02%7C01%7Cjblum%40vmware.com%7C6c125c591647400fdcd308d863e3b3bb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637369176913761570sdata=Q80xKsvdce1%2BIMXybp1%2BN2UnZjDkX77r9Kkrk3%2FK5os%3Dreserved=0]
GEODE-8522: Switching exception log back to debug (merge to 1.13) by 
upthewaterspout · Pull Request #5566 · 
apache/geode
This log message happens during the course of normal startup of multiple 
locators. We should not be logging a full stack trace during normal startup. 
(cherry picked from commit 3df057c) Thank you f...
github.com



[DISCUSS] One more 1.13 change

2020-09-28 Thread Dan Smith
Hi,

I'd like to backport this change to support/1.13 as well

GEODE-8522: Switching exception log back to debug - 
https://github.com/apache/geode/pull/5566

This cleans up some noise in our logs that customers might see.
[https://avatars3.githubusercontent.com/u/47359?s=400=4]
GEODE-8522: Switching exception log back to debug (merge to 1.13) by 
upthewaterspout · Pull Request #5566 · 
apache/geode
This log message happens during the course of normal startup of multiple 
locators. We should not be logging a full stack trace during normal startup. 
(cherry picked from commit 3df057c) Thank you f...
github.com



Re: [PROPOSAL] Backport usability improvements to support 1.13 branch

2020-09-28 Thread Dan Smith
+1 - I had a question on the PR itself about how we're merging these.

-Dan

From: Joris Melchior 
Sent: Thursday, September 24, 2020 8:13 AM
To: dev@geode.apache.org 
Subject: Re: [PROPOSAL] Backport usability improvements to support 1.13 branch

+1

On 2020-09-23, 7:23 PM, "Jason Huynh"  wrote:

Hello,

I’d like to merge the pull request: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F5524data=02%7C01%7Cdasmith%40vmware.com%7C07a555320cbd4d6f943a08d8609c7569%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637365572383970684sdata=qfpuzsR8VdYuv1GS3KKHVztAFX0S5yVITEW8b6oigoQ%3Dreserved=0
 into a support 1.13 branch.  The commits are focused on a few usability 
improvements for Geode that were thought to have made it into 1.13 but actually 
did not make it.

What this pull request back ports:

  *   GEODE-8203: Logging to std out along with to the regular log file
  *   GEODE-8283: Rest API for disk store creation
  *   GEODE-8200: Fix for Rebalance API stuck “IN_PROGRESS” state forever 
and GEODE-8200: Enhance GfshRule
  *   GEODE-8241: Locator observers locator-wait-time
  *   GEODE-8078: Log and report error at the correct place


The PR pipeline is failing due to Redis tests (that I don’t think are on 
1.13).  Everything else appears to be passing.

Thanks,
-Jason




Odg: Colocated regions missing some buckets after restart

2020-09-28 Thread Mario Kevo
Hi all,

After more investigation I found that for some buckets is problem to define 
which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary 
and after some time return null for it.

From what I found is while doing setHosting( 
grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java])
 it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it 
cannot get response from the other members.

Ticket is opened on GEODE: https://issues.apache.org/jira/browse/GEODE-8546
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated 
with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 " and then 
from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest 
region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help 
me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario



Šalje: Mario Kevo 
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org 
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this 
problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. 
This part of code can be found on this 
link.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario


Šalje: Anilkumar Gingade 
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org 
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can 
find threads stuck in region creation...This will show if there are any lock 
contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo"  wrote:

Hi Anil,

From server logs we see that have some threads stucked and continuosly get 
on server2 the following message(bucket missing on server2 for DfSessions 
region):
[warn 2020/09/15 14:25:39.852 CEST  
tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor 
/__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners 
[]


And on the other server1:
[warn 2020/09/15 14:25:40.852 CEST  
tid=0xdf] 15 seconds have elapsed while waiting for replies: 
:41003]> on 
192.168.0.145(server1:28031):41002 whose current membership list is: 
[[192.168.0.145(locator1:27244:locator):41000, 
192.168.0.145(locator2:27343:locator):41001, 
192.168.0.145(server1:28031):41002, 192.168.0.145(server2:28054):41003]]

[warn 2020/09/15 14:27:20.200 CEST  tid=0x11] Thread 223 
(0xdf) is stuck

[warn 2020/09/15 14:27:20.202 CEST  tid=0x11] Thread <223> 
(0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for 
<115.361 seconds> and number of thread monitor iteration <1>
Thread Name  state 
...
It seems that this is not problem with stats.
We have a some suspicion that the problem is with some lock, but we need to 
investigate it a bit more.

BR,
Mario




Šalje: Anilkumar Gingade 
Poslano: 15. rujna 2020. 16:36
Prima: dev@geode.apache.org 
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

I doubt this has anything to do with the client connections. If it is it 
should be between server/member to server/member connection; in that case the 
unresponsive member is kicked out from the cluster.

The recommended configuration is to have persistence regions for both 
parent and co-located regions (and replicated regions)...

There could be issues in the stats too...Can you try executing a 
test/validation code on server side to dump/list primary and secondary buckets.
You can do that using helper methods: 
pr.getDataStore().getAllLocalPrimaryBucketIds();

-Anil

On 9/14/20, 12:25 AM, "Mario Kevo"  wrote:

Hi,


This problem is usually seen only on 1 server. The other servers 
metrics and bucket count looks fine. Another symptom of this issue is that the 
max-connections limit is reached on the problematic server if we have a client 
that tries to reconnect after the server restart. Clients