[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387644#comment-17387644
 ] 

Aaron Lindsey edited comment on GEODE-8200 at 7/26/21, 11:01 PM:
-----------------------------------------------------------------

[~agingade] We first saw this issue on July 16 while testing Geode at commit 
[https://github.com/apache/geode/commit/8b7a1a242290523310080a13338c1f85a283c684].
 The issue was discovered in an automated test which had been passing 
consistently for some time. We have only seen the issue happen with the 
"restore redundancy" operation, but I cannot say for sure that the issue does 
not happen for the "rebalance" operation as well.

We have a closed-source test which reproduces the issue, but it does not 
reproduce the issue every time. The test does a rolling restart of the Geode 
cluster by restarting up to one locator and one server at a time. We have a 
Kubernetes hook which runs "restore redundancy" right before a server is 
stopped to reduce the chance of data loss. The hook is implemented such that 
the "restore redundancy" operation must succeed before the server can be 
stopped. Note that this is the exact same scenario as described in the original 
ticket description, except that we now use "restore redundancy" instead of 
"rebalance".


was (Author: aaronlindsey):
[~agingade] We first saw this issue on July 16 while testing Geode at commit 
[https://github.com/apache/geode/commit/8b7a1a242290523310080a13338c1f85a283c684.]
 The issue was discovered in an automated test which had been passing 
consistently for some time. We have only seen the issue happen with the 
"restore redundancy" operation, but I cannot say for sure that the issue does 
not happen for the "rebalance" operation as well.

We have a closed-source test which reproduces the issue, but it does not 
reproduce the issue every time. The test does a rolling restart of the Geode 
cluster by restarting up to one locator and one server at a time. We have a 
Kubernetes hook which runs "restore redundancy" right before a server is 
stopped to reduce the chance of data loss. The hook is implemented such that 
the "restore redundancy" operation must succeed before the server can be 
stopped. Note that this is the exact same scenario as described in the original 
ticket description, except that we now use "restore redundancy" instead of 
"rebalance".

> Rebalance operations stuck in "IN_PROGRESS" state forever
> ---------------------------------------------------------
>
>                 Key: GEODE-8200
>                 URL: https://issues.apache.org/jira/browse/GEODE-8200
>             Project: Geode
>          Issue Type: Bug
>          Components: management
>            Reporter: Aaron Lindsey
>            Assignee: Jianxia Chen
>            Priority: Major
>              Labels: GeodeOperationAPI, blocks-1.15.0​
>             Fix For: 1.13.1, 1.14.0
>
>         Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
>     {
>       "statusCode": "IN_PROGRESS",
>       "links": {
>         "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
>         "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>       },
>       "operationStart": "2020-05-27T22:38:30.619Z",
>       "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>       "operation": {
>         "simulate": false
>       }
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to