[
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen Nichols closed GEODE-8200.
-------------------------------
> Rebalance operations stuck in "IN_PROGRESS" state forever
> ---------------------------------------------------------
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
> Issue Type: Bug
> Components: management
> Affects Versions: 1.14.0, 1.15.0
> Reporter: Aaron Lindsey
> Assignee: Anilkumar Gingade
> Priority: Major
> Labels: GeodeOperationAPI, blocks-1.15.0, pull-request-available
> Fix For: 1.15.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping
> a server to limit the possibility of data loss. In a cluster with 3 locators,
> 3 servers, and no regions, we noticed that sometimes the rebalance operation
> never ends if one of the locators is restarting concurrently with the
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an
> automated "rolling restart" operation in a Kubernetes environment which
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never
> complete, because it cannot proceed with stopping a server until the
> rebalance operation is completed. A human is then required to intervene and
> manually trigger a rebalance and stop the server. This type of "rolling
> restart" operation is triggered fairly often in Kubernetes — any time part of
> the configuration of the locators or servers changes.
> The following JSON is a sample response from the management REST API that
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
> "statusCode": "IN_PROGRESS",
> "links": {
> "self":
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7",
> "list":
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"
> },
> "operationStart": "2020-05-27T22:38:30.619Z",
> "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
> "operation": {
> "simulate": false
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)