subject:"\[jira\] \[Commented\] \(GEODE\-8200\) Rebalance operations stuck in \"IN

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-09-30 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423064#comment-17423064
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 068c613195bc48ff84d69e65557554b1dcb7b0e4 in geode's branch 
refs/heads/develop from agingade
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=068c613 ]

GEODE-8200: After checking locator presence store the status in 
OperationStateStore (#6914)

* GEODE-8200: After checking locator presence store the status in 
OperationStateStore
Co-authored-by: anilkumar gingade 

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Aaron Lindsey
>Assignee: Anilkumar Gingade
>Priority: Major
>  Labels: GeodeOperationAPI, blocks-1.15.0, pull-request-available
> Fix For: 1.15.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-29 Thread Aaron Lindsey (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390194#comment-17390194
 ] 

Aaron Lindsey commented on GEODE-8200:
--

[~jchen21] yes, checking the restore status is done using the GET 
management/v1/operations/restoreRedundancy/{id} endpoint.

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI, blocks-1.15.0
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-29 Thread Jianxia Chen (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390190#comment-17390190
 ] 

Jianxia Chen commented on GEODE-8200:
-

[~aaronlindsey] How is the restore status checked? Is it checked by calling 
REST API?

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI, blocks-1.15.0
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-29 Thread Aaron Lindsey (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390136#comment-17390136
 ] 

Aaron Lindsey commented on GEODE-8200:
--

[~agingade] it looks like the first one—start restore, then periodically check 
the restore status until the restore completes. After the restore has 
completed, if the status says the restore failed for any reason it will retry 
the whole process. It repeats this process until it eventually gets a 
successful restore.

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI, blocks-1.15.0
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-29 Thread Anilkumar Gingade (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390119#comment-17390119
 ] 

Anilkumar Gingade commented on GEODE-8200:
--

[~jchen21]
[~aaronlindsey] What is the sequence of commands executed from k8 hook; is it 
like:
- Start restore
- Get restore status; if still in progress; keep checking the restore 
periodically
OR
- Start restore 
- Start restore again after sometime?



> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI, blocks-1.15.0
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-26 Thread Aaron Lindsey (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387644#comment-17387644
 ] 

Aaron Lindsey commented on GEODE-8200:
--

[~agingade] We first saw this issue on July 16 while testing Geode at commit 
[https://github.com/apache/geode/commit/8b7a1a242290523310080a13338c1f85a283c684.]
 The issue was discovered in an automated test which had been passing 
consistently for some time. We have only seen the issue happen with the 
"restore redundancy" operation, but I cannot say for sure that the issue does 
not happen for the "rebalance" operation as well.

We have a closed-source test which reproduces the issue, but it does not 
reproduce the issue every time. The test does a rolling restart of the Geode 
cluster by restarting up to one locator and one server at a time. We have a 
Kubernetes hook which runs "restore redundancy" right before a server is 
stopped to reduce the chance of data loss. The hook is implemented such that 
the "restore redundancy" operation must succeed before the server can be 
stopped. Note that this is the exact same scenario as described in the original 
ticket description, except that we now use "restore redundancy" instead of 
"rebalance".

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI, blocks-1.15.0
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-26 Thread Anilkumar Gingade (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387638#comment-17387638
 ] 

Anilkumar Gingade commented on GEODE-8200:
--

>> Re-opened because this issue has started reproducing again on develop.
[~aaronlindsey] Can you please add more details to this...Reading this ticket 
description, the issue was addressed with "Rebalance"; from your comments it 
seems like its with :restore redundancy" command...
Questions:
- Is the issue with the "rebalance" command?
- Is the issue only with "restore redundancy"?
- Is there a test that reproduces the issue?
- What are the steps involved in reproducing the issue? Does the issue 
reproduces every time?



> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-23 Thread Aaron Lindsey (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386578#comment-17386578
 ] 

Aaron Lindsey commented on GEODE-8200:
--

The only difference is that now it's happening for restore redundancy instead 
of rebalance, but it's pretty much the same scenario where we see the issue. 
(We switched to using restore redundancy instead of rebalance after that 
operation was added to Geode.)

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2021-07-23 Thread Aaron Lindsey (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386570#comment-17386570
 ] 

Aaron Lindsey commented on GEODE-8200:
--

Re-opened because this issue has started reproducing again on develop.

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.13.1, 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204098#comment-17204098
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch 
refs/heads/support/1.13 from Jinmei Liao
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ]

GEODE-8200: enhance GfshRule to specify a working dir (#5299)

* improve some backward compatibility test to cover more versions.

(cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204096#comment-17204096
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch 
refs/heads/support/1.13 from Jianxia Chen
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ]

GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350)

Record the locator that issues the original Rest API request. If the locator is 
offline afterwards, report an error.

Co-authored-by: Jianxia Chen 
Co-authored-by: Jinmei Liao 
(cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204088#comment-17204088
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch 
refs/heads/support/1.13 from Jianxia Chen
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ]

GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350)

Record the locator that issues the original Rest API request. If the locator is 
offline afterwards, report an error.

Co-authored-by: Jianxia Chen 
Co-authored-by: Jinmei Liao 
(cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204090#comment-17204090
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch 
refs/heads/support/1.13 from Jinmei Liao
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ]

GEODE-8200: enhance GfshRule to specify a working dir (#5299)

* improve some backward compatibility test to cover more versions.

(cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204079#comment-17204079
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch 
refs/heads/support/1.13 from Jianxia Chen
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ]

GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350)

Record the locator that issues the original Rest API request. If the locator is 
offline afterwards, report an error.

Co-authored-by: Jianxia Chen 
Co-authored-by: Jinmei Liao 
(cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204082#comment-17204082
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch 
refs/heads/support/1.13 from Jinmei Liao
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ]

GEODE-8200: enhance GfshRule to specify a working dir (#5299)

* improve some backward compatibility test to cover more versions.

(cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204070#comment-17204070
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch 
refs/heads/support/1.13 from Jinmei Liao
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ]

GEODE-8200: enhance GfshRule to specify a working dir (#5299)

* improve some backward compatibility test to cover more versions.

(cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204068#comment-17204068
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch 
refs/heads/support/1.13 from Jianxia Chen
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ]

GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350)

Record the locator that issues the original Rest API request. If the locator is 
offline afterwards, report an error.

Co-authored-by: Jianxia Chen 
Co-authored-by: Jinmei Liao 
(cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204062#comment-17204062
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch 
refs/heads/support/1.13 from Jinmei Liao
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ]

GEODE-8200: enhance GfshRule to specify a working dir (#5299)

* improve some backward compatibility test to cover more versions.

(cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-09-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204059#comment-17204059
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch 
refs/heads/support/1.13 from Jianxia Chen
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ]

GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350)

Record the locator that issues the original Rest API request. If the locator is 
offline afterwards, report an error.

Co-authored-by: Jianxia Chen 
Co-authored-by: Jinmei Liao 
(cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Fix For: 1.14.0
>
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-10 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155757#comment-17155757
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 426b9de66ae9adadde8f1994b2340fc9de809d81 in geode's branch 
refs/heads/develop from Jianxia Chen
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=426b9de ]

GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350)

Record the locator that issues the original Rest API request. If the locator is 
offline afterwards, report an error.

Co-authored-by: Jianxia Chen 
Co-authored-by: Jinmei Liao 

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155756#comment-17155756
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jchen21 merged pull request #5350:
URL: https://github.com/apache/geode/pull/5350


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154981#comment-17154981
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jchen21 commented on a change in pull request #5350:
URL: https://github.com/apache/geode/pull/5350#discussion_r452537292



##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java
##
@@ -28,12 +28,25 @@
  */
 public class OperationState, V extends 
OperationResult>
 implements Identifiable {
+  private static final long serialVersionUID = 8212319653561969588L;
   private final String opId;
   private final A operation;
   private final Date operationStart;
   private Date operationEnd;
   private V result;
   private Throwable throwable;
+  private String locator;

Review comment:
   Do you mean `InternalDistributedMember` when you talk about 
`DistributedID`? I was using the `InternalDistributedMember` as the type for 
`locator` field. But it is a large object with many fields. We only need to 
identify the locator, so a String type of a member's ID should be good.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154974#comment-17154974
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jchen21 commented on a change in pull request #5350:
URL: https://github.com/apache/geode/pull/5350#discussion_r452534574



##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationHistoryManager.java
##
@@ -90,6 +95,27 @@ private static boolean isExpired(long expirationTime, 
OperationState opera
 return operationEnd.getTime() <= expirationTime;
   }
 
+  private OperationState validateLocator(OperationState 
operationState) {
+if (isLocatorOffline(operationState)) {
+  operationState.setOperationEnd(new Date(), null,
+  new RuntimeException("Locator that initiated the Rest API operation 
is offline: "
+  + operationState.getLocator()));
+}
+
+return operationState;
+  }
+
+  private boolean isLocatorOffline(OperationState operationState) {
+if (operationState.getOperationEnd() == null
+&& (operationState.getLocator() != null)
+&& cache.getMyId().toString().compareTo(operationState.getLocator()) 
!= 0

Review comment:
   Good point!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154971#comment-17154971
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

agingade commented on a change in pull request #5350:
URL: https://github.com/apache/geode/pull/5350#discussion_r452518577



##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationHistoryManager.java
##
@@ -90,6 +95,27 @@ private static boolean isExpired(long expirationTime, 
OperationState opera
 return operationEnd.getTime() <= expirationTime;
   }
 
+  private OperationState validateLocator(OperationState 
operationState) {
+if (isLocatorOffline(operationState)) {
+  operationState.setOperationEnd(new Date(), null,
+  new RuntimeException("Locator that initiated the Rest API operation 
is offline: "
+  + operationState.getLocator()));
+}
+
+return operationState;
+  }
+
+  private boolean isLocatorOffline(OperationState operationState) {
+if (operationState.getOperationEnd() == null
+&& (operationState.getLocator() != null)
+&& cache.getMyId().toString().compareTo(operationState.getLocator()) 
!= 0

Review comment:
   Does it need to be compared? can it be changed to "equals"

##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java
##
@@ -28,12 +28,25 @@
  */
 public class OperationState, V extends 
OperationResult>
 implements Identifiable {
+  private static final long serialVersionUID = 8212319653561969588L;
   private final String opId;
   private final A operation;
   private final Date operationStart;
   private Date operationEnd;
   private V result;
   private Throwable throwable;
+  private String locator;

Review comment:
   Can this be DistributedID than a String ID. That way we can avoid 
converting to string in other places?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154964#comment-17154964
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jchen21 commented on a change in pull request #5350:
URL: https://github.com/apache/geode/pull/5350#discussion_r452488277



##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java
##
@@ -28,12 +28,25 @@
  */
 public class OperationState, V extends 
OperationResult>
 implements Identifiable {
+  private static final long serialVersionUID = 8212319653561969588L;
   private final String opId;
   private final A operation;
   private final Date operationStart;
   private Date operationEnd;
   private V result;
   private Throwable throwable;
+  private String locator;
+
+  public String getLocator() {
+return this.locator;
+  }
+
+  public void setLocator(
+  String locator) {
+synchronized (this) {

Review comment:
   This is for consistency. `setOperationEnd()` does the same.

##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationStateStore.java
##
@@ -53,6 +53,8 @@
*/
void recordEnd(String opId, V result, Throwable 
exception);
 
+  void recordLocator(String opId, String locator);

Review comment:
   Good point.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154023#comment-17154023
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jinmeiliao commented on a change in pull request #5350:
URL: https://github.com/apache/geode/pull/5350#discussion_r451826206



##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java
##
@@ -28,12 +28,25 @@
  */
 public class OperationState, V extends 
OperationResult>
 implements Identifiable {
+  private static final long serialVersionUID = 8212319653561969588L;
   private final String opId;
   private final A operation;
   private final Date operationStart;
   private Date operationEnd;
   private V result;
   private Throwable throwable;
+  private String locator;
+
+  public String getLocator() {
+return this.locator;
+  }
+
+  public void setLocator(
+  String locator) {
+synchronized (this) {

Review comment:
   this is just one line operation, is this not atomic? If not, can we put 
the synchronize on the method?

##
File path: 
geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationStateStore.java
##
@@ -53,6 +53,8 @@
*/
void recordEnd(String opId, V result, Throwable 
exception);
 
+  void recordLocator(String opId, String locator);

Review comment:
   instead of adding this interface, you can change the method of 
recordStart() to add the a locator id parameter, since when started, we should 
always know what locator started this operation.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-07-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152396#comment-17152396
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jchen21 opened a new pull request #5350:
URL: https://github.com/apache/geode/pull/5350


   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in 
the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target 
branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build 
issues and
   submit an update to your PR as soon as possible. If you need help, please 
send an
   email to d...@geode.apache.org.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145103#comment-17145103
 ] 

ASF subversion and git services commented on GEODE-8200:


Commit 561533c53cf44e53c42f26cd988eae6821af6769 in geode's branch 
refs/heads/develop from Jinmei Liao
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=561533c ]

GEODE-8200: enhance GfshRule to specify a working dir (#5299)

* improve some backward compatibility test to cover more versions.

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145100#comment-17145100
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jinmeiliao commented on a change in pull request #5299:
URL: https://github.com/apache/geode/pull/5299#discussion_r445705089



##
File path: 
geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java
##
@@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) {
 }
 execute(GfshScript.of(stopMemberScripts).withName("Stop-Members"));
   }
+
+  public static String startServerCommand(String name, int port, int 
connectedLocatorPort) {

Review comment:
   Thanks! I will merge this now and we can make improvements on it when we 
get to consolidating all these methods into one place.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145102#comment-17145102
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jinmeiliao merged pull request #5299:
URL: https://github.com/apache/geode/pull/5299


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145088#comment-17145088
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

kirklund commented on a change in pull request #5299:
URL: https://github.com/apache/geode/pull/5299#discussion_r445695124



##
File path: 
geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java
##
@@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) {
 }
 execute(GfshScript.of(stopMemberScripts).withName("Stop-Members"));
   }
+
+  public static String startServerCommand(String name, int port, int 
connectedLocatorPort) {

Review comment:
   We could commit what you have ready and then make further changes. I'll 
go ahead and approve and let you decide.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145087#comment-17145087
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

kirklund commented on a change in pull request #5299:
URL: https://github.com/apache/geode/pull/5299#discussion_r445693168



##
File path: 
geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java
##
@@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) {
 }
 execute(GfshScript.of(stopMemberScripts).withName("Stop-Members"));
   }
+
+  public static String startServerCommand(String name, int port, int 
connectedLocatorPort) {

Review comment:
   There are actually a couple existing classes like this that we could 
move to geode-junit and make any necessary changes:
   ```
   
geode-core/src/integrationTest/java/org/apache/geode/distributed/LocatorCommand.java
   
geode-core/src/integrationTest/java/org/apache/geode/distributed/ServerCommand.java
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145085#comment-17145085
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

kirklund commented on a change in pull request #5299:
URL: https://github.com/apache/geode/pull/5299#discussion_r445693168



##
File path: 
geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java
##
@@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) {
 }
 execute(GfshScript.of(stopMemberScripts).withName("Stop-Members"));
   }
+
+  public static String startServerCommand(String name, int port, int 
connectedLocatorPort) {

Review comment:
   There are actually a couple existing classes like this that we could 
move to geode-dunit and make any necessary changes:
   ```
   
geode-core/src/integrationTest/java/org/apache/geode/distributed/LocatorCommand.java
   
geode-core/src/integrationTest/java/org/apache/geode/distributed/ServerCommand.java
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145081#comment-17145081
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

kirklund commented on a change in pull request #5299:
URL: https://github.com/apache/geode/pull/5299#discussion_r445691690



##
File path: 
geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java
##
@@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) {
 }
 execute(GfshScript.of(stopMemberScripts).withName("Stop-Members"));
   }
+
+  public static String startServerCommand(String name, int port, int 
connectedLocatorPort) {

Review comment:
   These new static methods don't really fit well in GfshRule. I think you 
should probably move them to two new classes (probably not a rule) named 
something like ServerCommandBuilder and LocatorCommandBuilder, and explode the 
parameters to each have their own setter type method:
   ```
   String startServerCommand = new ServerCommandBuilder()
   .withPort(port)
   .withLocator(locatorPort)
   .create(name);
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144488#comment-17144488
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jchen21 commented on a change in pull request #5299:
URL: https://github.com/apache/geode/pull/5299#discussion_r445226334



##
File path: 
geode-assembly/src/upgradeTest/java/org/apache/geode/management/DeploymentManagementUpgradeTest.java
##
@@ -55,16 +84,16 @@ public static void beforeClass() throws Exception {
 
   @Test
   public void newLocatorCanReadOldConfigurationData() throws IOException {
-File workingDir = tempFolder.newFolder();
 int[] ports = AvailablePortHelper.getRandomAvailableTCPPorts(3);
-oldGfsh.execute("start locator --name=test --port=" + ports[0] + " 
--http-service-port="
-+ ports[1] + " --dir=" + workingDir.getAbsolutePath() + " 
--J=-Dgemfire.jmx-manager-port="
-+ ports[2],
-"deploy --jar=" + clusterJar.getAbsolutePath(),
-"shutdown --include-locators");
+GfshExecution execute =
+GfshScript.of(startLocatorCommand("test", ports[0], ports[2], 
ports[1], 0))

Review comment:
   For code readability, it is better to have some meaningful names for the 
ports. Before the code change, it is easy to identify the purpose of a port 
with the gfsh option in the code. Now it is not that straight forward to tell 
ports[0] is locator port. port[2] is JMX port and ports[1] is HTTP port, unless 
you see to the definition of `startLocatorCommand`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-06-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144391#comment-17144391
 ] 

ASF GitHub Bot commented on GEODE-8200:
---

jinmeiliao opened a new pull request #5299:
URL: https://github.com/apache/geode/pull/5299


   * improve some backward compatibility test to cover more versions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Assignee: Jianxia Chen
>Priority: Major
>  Labels: GeodeOperationAPI
> Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-05-29 Thread Anilkumar Gingade (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119989#comment-17119989
 ] 

Anilkumar Gingade commented on GEODE-8200:
--

probable workaround be, restart locator and server in sequence.

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Priority: Major
>  Labels: GeodeOperationAPI
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-05-29 Thread Anilkumar Gingade (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119985#comment-17119985
 ] 

Anilkumar Gingade commented on GEODE-8200:
--

The test scenario is:
3 locators 
3 servers
Concurrent "rolling restart locator" and "rolling restart server"

In rolling restart locator:
For each locator:
- Stop locator
- Start locator
- Wait for locator to come online (using rest api)

In rolling restart server:
For each Server:
- Call rest rebalance
- Wait for rebalance to complete
- Stop Server
- Start Server
- Wait for Server to come online (using rest api)


> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Priority: Major
>  Labels: GeodeOperationAPI
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-05-29 Thread Jinmei Liao (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119892#comment-17119892
 ] 

Jinmei Liao commented on GEODE-8200:


If a locator that initiated the rebalance operation went down before it can 
record the "completed" state of the operation, then that operation status will 
be "orphaned". i.e. the executor died, there is no one to execute the 
"whenComplete" section of the code here: 
https://github.com/apache/geode/blob/57cc3c7b40816bc1b7bbae80481dea608c7caff5/geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationManager.java#L70

That operation id will be stuck in "in_progress" status and kept in the history 
forever, because apparently we only clean the "done" record.

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Priority: Major
>  Labels: GeodeOperationAPI
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

2020-05-28 Thread Geode Integration (Jira)



[ 
https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119174#comment-17119174
 ] 

Geode Integration commented on GEODE-8200:
--

A Pivotal Tracker story has been created for this Issue: 
https://www.pivotaltracker.com/story/show/173071677

> Rebalance operations stuck in "IN_PROGRESS" state forever
> -
>
> Key: GEODE-8200
> URL: https://issues.apache.org/jira/browse/GEODE-8200
> Project: Geode
>  Issue Type: Bug
>  Components: management
>Reporter: Aaron Lindsey
>Priority: Major
>  Labels: GeodeOperationAPI
>
> We use the management REST API to call rebalance immediately before stopping 
> a server to limit the possibility of data loss. In a cluster with 3 locators, 
> 3 servers, and no regions, we noticed that sometimes the rebalance operation 
> never ends if one of the locators is restarting concurrently with the 
> rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an 
> automated "rolling restart" operation in a Kubernetes environment which 
> proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online 
> before restarting
> * Immediately before stopping a server, a rebalance operation is performed 
> and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never 
> complete, because it cannot proceed with stopping a server until the 
> rebalance operation is completed. A human is then required to intervene and 
> manually trigger a rebalance and stop the server. This type of "rolling 
> restart" operation is triggered fairly often in Kubernetes — any time part of 
> the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that 
> shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
> {
>   "statusCode": "IN_PROGRESS",
>   "links": {
> "self": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
> "list": 
> "http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
>   },
>   "operationStart": "2020-05-27T22:38:30.619Z",
>   "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>   "operation": {
> "simulate": false
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

40 matches

Mail list logo