Odg: Colocated regions missing some buckets after restart

Mario Kevo Wed, 16 Sep 2020 06:30:50 -0700

Hi Anil,

From server logs we see that have some threads stucked and continuosly get on 
server2 the following message(bucket missing on server2 for DfSessions region):
[warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> 
tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor 
/__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners 
[]



And on the other server1:
[warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 
15 seconds have elapsed while waiting for replies: 
<FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 
replies from [192.168.0.145(server2:28054)<v6>:41003]> on 
192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: 
[[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 
192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 
192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

[warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) 
is stuck

[warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> 
(0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for 
<115.361 seconds> and number of thread monitor iteration <1>
Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
...
It seems that this is not problem with stats.
We have a some suspicion that the problem is with some lock, but we need to 
investigate it a bit more.

BR,
Mario



________________________________
Šalje: Anilkumar Gingade <[email protected]>
Poslano: 15. rujna 2020. 16:36
Prima: [email protected] <[email protected]>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

I doubt this has anything to do with the client connections. If it is it should 
be between server/member to server/member connection; in that case the 
unresponsive member is kicked out from the cluster.

The recommended configuration is to have persistence regions for both parent 
and co-located regions (and replicated regions)...

There could be issues in the stats too...Can you try executing a 
test/validation code on server side to dump/list primary and secondary buckets.
You can do that using helper methods: 
pr.getDataStore().getAllLocalPrimaryBucketIds();

-Anil

On 9/14/20, 12:25 AM, "Mario Kevo" <[email protected]> wrote:

    Hi,


    This problem is usually seen only on 1 server. The other servers metrics 
and bucket count looks fine. Another symptom of this issue is that the 
max-connections limit is reached on the problematic server if we have a client 
that tries to reconnect after the server restart. Clients simply get no 
response from the server so they try to close the connection, but the 
connection close is not acknowledged by the server. On server side we see that 
the connections are in CLOSE-WAIT state with packets in the socket receiver 
queue. It’s as if the servers just stopped processing packets on the sockets 
while waiting for a member with the primary bucket.



    So in short, each new client connection is “unresponsive”. The client tries 
to close it a open a new one, but the socket doesn’t get closed on server side 
and the connection is left “hanging” on the server. Clients will try to do this 
until max-connections is reached on the servers. This is why we would be unable 
to add any data to the regions. But IMHO it’s really not dependent on adding 
data, since this issue happens occasionally (1 out of ~4 restarts) and only on 
one server.



    The initial problem was observed with a persistent region A (with 10000 
key-value pairs inserted) and a non-persistent region B collocated with region 
A. We did some tests with both regions being persistent. We haven’t observed 
the same issue yet (although we did only a few restarts), but we observed 
something that also looks quite worrying. Both servers start up without 
reporting issues in the logs. But, looking at the server metrics, one server 
has wrong information about “bucketCount” and is missing primary buckets. E.g:


    First server:

    Partition               | putLocalRate                 | 0.0

    | putRemoteRate                | 0.0

    | putRemoteLatency             | 0

    | putRemoteAvgLatency          | 0

    | bucketCount                  | 113

    | primaryBucketCount           | 57



    Second server:

    Partition               | putLocalRate                 | 0.0

    | putRemoteRate                | 0.0

    | putRemoteLatency             | 0

    | putRemoteAvgLatency          | 0

    | bucketCount                  | 111

    | primaryBucketCount           | 55


    So we are missing a primary bucket without being aware of the issue.

    BR,
    Mario

    ________________________________
    Šalje: Anilkumar Gingade <[email protected]>
    Poslano: 11. rujna 2020. 20:34
    Prima: [email protected] <[email protected]>
    Predmet: Re: Colocated regions missing some buckets after restart

    Are you seeing no-buckets for persistent regions or non-persistent. The 
buckets are created dynamically; when data is added to corresponding buckets...
    When server is restarted, in case of in-memory regions as the data is not 
there, the bucket region may not have been created (my suspicion).
    Can you try adding data and see if the co-located bucket region gets 
created in respective nodes/server.

    -Anil.


    On 9/11/20, 9:46 AM, "Mario Kevo" <[email protected]> wrote:

        Hi geode-dev,

        We have a system with two servers and a few regions. One region is 
persistent and other are not but they are colocated with this persistent region.
        After servers restart on some region we can see that they don't have 
any bucket.
        gfsh>show metrics --member=server-1 --region=/region1 
--categories=partition
        Metrics for region:/region1 On Member server-1


        Category  |            Metric            | Value
        --------- | ---------------------------- | -----
        partition | putLocalRate                 | 0.0
                  | putRemoteRate                | 0.0
                  | putRemoteLatency             | 0
                  | putRemoteAvgLatency          | 0
                  | bucketCount                  | 0
                  | primaryBucketCount           | 0
                  | configuredRedundancy         | 1
                  | actualRedundancy             | 0
                  | numBucketsWithoutRedundancy  | 113
                  | totalBucketSize              | 0

        gfsh>show metrics --member=server-0 --region=/region1 
--categories=partition
        Metrics for region:/region1 On Member server-0

        Category  |            Metric            | Value
        --------- | ---------------------------- | -----
        partition | putLocalRate                 | 0.0
                  | putRemoteRate                | 0.0
                  | putRemoteLatency             | 0
                  | putRemoteAvgLatency          | 0
                  | bucketCount                  | 113
                  | primaryBucketCount           | 56
                  | configuredRedundancy         | 1
                  | actualRedundancy             | 0
                  | numBucketsWithoutRedundancy  | 113
                  | totalBucketSize              | 0


        The persistent region is ok, but some of these colocated regions has 
this issue. We also wait some time, but it doesn't change.

        Does anyone have some idea about this problem, what causing the issue?
        The issue can be easily reproduced with two locators, two servers, one 
persistent region and few non-persistent regions colocated with persistent one.
        After restart both servers and try to do show metrics command you will 
got this issue for some regions.

        BR,
        Mario

Odg: Colocated regions missing some buckets after restart

Reply via email to