[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423064#comment-17423064 ] ASF subversion and git services commented on GEODE-8200: Commit 068c613195bc48ff84d69e65557554b1dcb7b0e4 in geode's branch refs/heads/develop from agingade [ https://gitbox.apache.org/repos/asf?p=geode.git;h=068c613 ] GEODE-8200: After checking locator presence store the status in OperationStateStore (#6914) * GEODE-8200: After checking locator presence store the status in OperationStateStore Co-authored-by: anilkumar gingade > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Affects Versions: 1.14.0, 1.15.0 >Reporter: Aaron Lindsey >Assignee: Anilkumar Gingade >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, pull-request-available > Fix For: 1.15.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390194#comment-17390194 ] Aaron Lindsey commented on GEODE-8200: -- [~jchen21] yes, checking the restore status is done using the GET management/v1/operations/restoreRedundancy/{id} endpoint. > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0 > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390190#comment-17390190 ] Jianxia Chen commented on GEODE-8200: - [~aaronlindsey] How is the restore status checked? Is it checked by calling REST API? > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0 > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390136#comment-17390136 ] Aaron Lindsey commented on GEODE-8200: -- [~agingade] it looks like the first one—start restore, then periodically check the restore status until the restore completes. After the restore has completed, if the status says the restore failed for any reason it will retry the whole process. It repeats this process until it eventually gets a successful restore. > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0 > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390119#comment-17390119 ] Anilkumar Gingade commented on GEODE-8200: -- [~jchen21] [~aaronlindsey] What is the sequence of commands executed from k8 hook; is it like: - Start restore - Get restore status; if still in progress; keep checking the restore periodically OR - Start restore - Start restore again after sometime? > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0 > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387644#comment-17387644 ] Aaron Lindsey commented on GEODE-8200: -- [~agingade] We first saw this issue on July 16 while testing Geode at commit [https://github.com/apache/geode/commit/8b7a1a242290523310080a13338c1f85a283c684.] The issue was discovered in an automated test which had been passing consistently for some time. We have only seen the issue happen with the "restore redundancy" operation, but I cannot say for sure that the issue does not happen for the "rebalance" operation as well. We have a closed-source test which reproduces the issue, but it does not reproduce the issue every time. The test does a rolling restart of the Geode cluster by restarting up to one locator and one server at a time. We have a Kubernetes hook which runs "restore redundancy" right before a server is stopped to reduce the chance of data loss. The hook is implemented such that the "restore redundancy" operation must succeed before the server can be stopped. Note that this is the exact same scenario as described in the original ticket description, except that we now use "restore redundancy" instead of "rebalance". > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0 > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387638#comment-17387638 ] Anilkumar Gingade commented on GEODE-8200: -- >> Re-opened because this issue has started reproducing again on develop. [~aaronlindsey] Can you please add more details to this...Reading this ticket description, the issue was addressed with "Rebalance"; from your comments it seems like its with :restore redundancy" command... Questions: - Is the issue with the "rebalance" command? - Is the issue only with "restore redundancy"? - Is there a test that reproduces the issue? - What are the steps involved in reproducing the issue? Does the issue reproduces every time? > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386578#comment-17386578 ] Aaron Lindsey commented on GEODE-8200: -- The only difference is that now it's happening for restore redundancy instead of rebalance, but it's pretty much the same scenario where we see the issue. (We switched to using restore redundancy instead of rebalance after that operation was added to Geode.) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386570#comment-17386570 ] Aaron Lindsey commented on GEODE-8200: -- Re-opened because this issue has started reproducing again on develop. > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.13.1, 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204098#comment-17204098 ] ASF subversion and git services commented on GEODE-8200: Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch refs/heads/support/1.13 from Jinmei Liao [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ] GEODE-8200: enhance GfshRule to specify a working dir (#5299) * improve some backward compatibility test to cover more versions. (cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204096#comment-17204096 ] ASF subversion and git services commented on GEODE-8200: Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch refs/heads/support/1.13 from Jianxia Chen [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ] GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350) Record the locator that issues the original Rest API request. If the locator is offline afterwards, report an error. Co-authored-by: Jianxia Chen Co-authored-by: Jinmei Liao (cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204088#comment-17204088 ] ASF subversion and git services commented on GEODE-8200: Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch refs/heads/support/1.13 from Jianxia Chen [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ] GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350) Record the locator that issues the original Rest API request. If the locator is offline afterwards, report an error. Co-authored-by: Jianxia Chen Co-authored-by: Jinmei Liao (cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204090#comment-17204090 ] ASF subversion and git services commented on GEODE-8200: Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch refs/heads/support/1.13 from Jinmei Liao [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ] GEODE-8200: enhance GfshRule to specify a working dir (#5299) * improve some backward compatibility test to cover more versions. (cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204079#comment-17204079 ] ASF subversion and git services commented on GEODE-8200: Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch refs/heads/support/1.13 from Jianxia Chen [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ] GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350) Record the locator that issues the original Rest API request. If the locator is offline afterwards, report an error. Co-authored-by: Jianxia Chen Co-authored-by: Jinmei Liao (cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204082#comment-17204082 ] ASF subversion and git services commented on GEODE-8200: Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch refs/heads/support/1.13 from Jinmei Liao [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ] GEODE-8200: enhance GfshRule to specify a working dir (#5299) * improve some backward compatibility test to cover more versions. (cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204070#comment-17204070 ] ASF subversion and git services commented on GEODE-8200: Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch refs/heads/support/1.13 from Jinmei Liao [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ] GEODE-8200: enhance GfshRule to specify a working dir (#5299) * improve some backward compatibility test to cover more versions. (cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204068#comment-17204068 ] ASF subversion and git services commented on GEODE-8200: Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch refs/heads/support/1.13 from Jianxia Chen [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ] GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350) Record the locator that issues the original Rest API request. If the locator is offline afterwards, report an error. Co-authored-by: Jianxia Chen Co-authored-by: Jinmei Liao (cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204062#comment-17204062 ] ASF subversion and git services commented on GEODE-8200: Commit 3149d9680e97351090c61f1ceb694bf8b7d6f182 in geode's branch refs/heads/support/1.13 from Jinmei Liao [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3149d96 ] GEODE-8200: enhance GfshRule to specify a working dir (#5299) * improve some backward compatibility test to cover more versions. (cherry picked from commit 561533c53cf44e53c42f26cd988eae6821af6769) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204059#comment-17204059 ] ASF subversion and git services commented on GEODE-8200: Commit 4721b164501bba82309041522172276ed3042f4a in geode's branch refs/heads/support/1.13 from Jianxia Chen [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4721b16 ] GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350) Record the locator that issues the original Rest API request. If the locator is offline afterwards, report an error. Co-authored-by: Jianxia Chen Co-authored-by: Jinmei Liao (cherry picked from commit 426b9de66ae9adadde8f1994b2340fc9de809d81) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Fix For: 1.14.0 > > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155757#comment-17155757 ] ASF subversion and git services commented on GEODE-8200: Commit 426b9de66ae9adadde8f1994b2340fc9de809d81 in geode's branch refs/heads/develop from Jianxia Chen [ https://gitbox.apache.org/repos/asf?p=geode.git;h=426b9de ] GEODE-8200: Rebalance operations stuck in "IN_PROGRESS" state forever (#5350) Record the locator that issues the original Rest API request. If the locator is offline afterwards, report an error. Co-authored-by: Jianxia Chen Co-authored-by: Jinmei Liao > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155756#comment-17155756 ] ASF GitHub Bot commented on GEODE-8200: --- jchen21 merged pull request #5350: URL: https://github.com/apache/geode/pull/5350 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154981#comment-17154981 ] ASF GitHub Bot commented on GEODE-8200: --- jchen21 commented on a change in pull request #5350: URL: https://github.com/apache/geode/pull/5350#discussion_r452537292 ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java ## @@ -28,12 +28,25 @@ */ public class OperationState, V extends OperationResult> implements Identifiable { + private static final long serialVersionUID = 8212319653561969588L; private final String opId; private final A operation; private final Date operationStart; private Date operationEnd; private V result; private Throwable throwable; + private String locator; Review comment: Do you mean `InternalDistributedMember` when you talk about `DistributedID`? I was using the `InternalDistributedMember` as the type for `locator` field. But it is a large object with many fields. We only need to identify the locator, so a String type of a member's ID should be good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154974#comment-17154974 ] ASF GitHub Bot commented on GEODE-8200: --- jchen21 commented on a change in pull request #5350: URL: https://github.com/apache/geode/pull/5350#discussion_r452534574 ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationHistoryManager.java ## @@ -90,6 +95,27 @@ private static boolean isExpired(long expirationTime, OperationState opera return operationEnd.getTime() <= expirationTime; } + private OperationState validateLocator(OperationState operationState) { +if (isLocatorOffline(operationState)) { + operationState.setOperationEnd(new Date(), null, + new RuntimeException("Locator that initiated the Rest API operation is offline: " + + operationState.getLocator())); +} + +return operationState; + } + + private boolean isLocatorOffline(OperationState operationState) { +if (operationState.getOperationEnd() == null +&& (operationState.getLocator() != null) +&& cache.getMyId().toString().compareTo(operationState.getLocator()) != 0 Review comment: Good point! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154971#comment-17154971 ] ASF GitHub Bot commented on GEODE-8200: --- agingade commented on a change in pull request #5350: URL: https://github.com/apache/geode/pull/5350#discussion_r452518577 ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationHistoryManager.java ## @@ -90,6 +95,27 @@ private static boolean isExpired(long expirationTime, OperationState opera return operationEnd.getTime() <= expirationTime; } + private OperationState validateLocator(OperationState operationState) { +if (isLocatorOffline(operationState)) { + operationState.setOperationEnd(new Date(), null, + new RuntimeException("Locator that initiated the Rest API operation is offline: " + + operationState.getLocator())); +} + +return operationState; + } + + private boolean isLocatorOffline(OperationState operationState) { +if (operationState.getOperationEnd() == null +&& (operationState.getLocator() != null) +&& cache.getMyId().toString().compareTo(operationState.getLocator()) != 0 Review comment: Does it need to be compared? can it be changed to "equals" ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java ## @@ -28,12 +28,25 @@ */ public class OperationState, V extends OperationResult> implements Identifiable { + private static final long serialVersionUID = 8212319653561969588L; private final String opId; private final A operation; private final Date operationStart; private Date operationEnd; private V result; private Throwable throwable; + private String locator; Review comment: Can this be DistributedID than a String ID. That way we can avoid converting to string in other places? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154964#comment-17154964 ] ASF GitHub Bot commented on GEODE-8200: --- jchen21 commented on a change in pull request #5350: URL: https://github.com/apache/geode/pull/5350#discussion_r452488277 ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java ## @@ -28,12 +28,25 @@ */ public class OperationState, V extends OperationResult> implements Identifiable { + private static final long serialVersionUID = 8212319653561969588L; private final String opId; private final A operation; private final Date operationStart; private Date operationEnd; private V result; private Throwable throwable; + private String locator; + + public String getLocator() { +return this.locator; + } + + public void setLocator( + String locator) { +synchronized (this) { Review comment: This is for consistency. `setOperationEnd()` does the same. ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationStateStore.java ## @@ -53,6 +53,8 @@ */ void recordEnd(String opId, V result, Throwable exception); + void recordLocator(String opId, String locator); Review comment: Good point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154023#comment-17154023 ] ASF GitHub Bot commented on GEODE-8200: --- jinmeiliao commented on a change in pull request #5350: URL: https://github.com/apache/geode/pull/5350#discussion_r451826206 ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationState.java ## @@ -28,12 +28,25 @@ */ public class OperationState, V extends OperationResult> implements Identifiable { + private static final long serialVersionUID = 8212319653561969588L; private final String opId; private final A operation; private final Date operationStart; private Date operationEnd; private V result; private Throwable throwable; + private String locator; + + public String getLocator() { +return this.locator; + } + + public void setLocator( + String locator) { +synchronized (this) { Review comment: this is just one line operation, is this not atomic? If not, can we put the synchronize on the method? ## File path: geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationStateStore.java ## @@ -53,6 +53,8 @@ */ void recordEnd(String opId, V result, Throwable exception); + void recordLocator(String opId, String locator); Review comment: instead of adding this interface, you can change the method of recordStart() to add the a locator id parameter, since when started, we should always know what locator started this operation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152396#comment-17152396 ] ASF GitHub Bot commented on GEODE-8200: --- jchen21 opened a new pull request #5350: URL: https://github.com/apache/geode/pull/5350 Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [ ] Does `gradlew build` run cleanly? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to d...@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145103#comment-17145103 ] ASF subversion and git services commented on GEODE-8200: Commit 561533c53cf44e53c42f26cd988eae6821af6769 in geode's branch refs/heads/develop from Jinmei Liao [ https://gitbox.apache.org/repos/asf?p=geode.git;h=561533c ] GEODE-8200: enhance GfshRule to specify a working dir (#5299) * improve some backward compatibility test to cover more versions. > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145100#comment-17145100 ] ASF GitHub Bot commented on GEODE-8200: --- jinmeiliao commented on a change in pull request #5299: URL: https://github.com/apache/geode/pull/5299#discussion_r445705089 ## File path: geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java ## @@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) { } execute(GfshScript.of(stopMemberScripts).withName("Stop-Members")); } + + public static String startServerCommand(String name, int port, int connectedLocatorPort) { Review comment: Thanks! I will merge this now and we can make improvements on it when we get to consolidating all these methods into one place. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145102#comment-17145102 ] ASF GitHub Bot commented on GEODE-8200: --- jinmeiliao merged pull request #5299: URL: https://github.com/apache/geode/pull/5299 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145088#comment-17145088 ] ASF GitHub Bot commented on GEODE-8200: --- kirklund commented on a change in pull request #5299: URL: https://github.com/apache/geode/pull/5299#discussion_r445695124 ## File path: geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java ## @@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) { } execute(GfshScript.of(stopMemberScripts).withName("Stop-Members")); } + + public static String startServerCommand(String name, int port, int connectedLocatorPort) { Review comment: We could commit what you have ready and then make further changes. I'll go ahead and approve and let you decide. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145087#comment-17145087 ] ASF GitHub Bot commented on GEODE-8200: --- kirklund commented on a change in pull request #5299: URL: https://github.com/apache/geode/pull/5299#discussion_r445693168 ## File path: geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java ## @@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) { } execute(GfshScript.of(stopMemberScripts).withName("Stop-Members")); } + + public static String startServerCommand(String name, int port, int connectedLocatorPort) { Review comment: There are actually a couple existing classes like this that we could move to geode-junit and make any necessary changes: ``` geode-core/src/integrationTest/java/org/apache/geode/distributed/LocatorCommand.java geode-core/src/integrationTest/java/org/apache/geode/distributed/ServerCommand.java ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145085#comment-17145085 ] ASF GitHub Bot commented on GEODE-8200: --- kirklund commented on a change in pull request #5299: URL: https://github.com/apache/geode/pull/5299#discussion_r445693168 ## File path: geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java ## @@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) { } execute(GfshScript.of(stopMemberScripts).withName("Stop-Members")); } + + public static String startServerCommand(String name, int port, int connectedLocatorPort) { Review comment: There are actually a couple existing classes like this that we could move to geode-dunit and make any necessary changes: ``` geode-core/src/integrationTest/java/org/apache/geode/distributed/LocatorCommand.java geode-core/src/integrationTest/java/org/apache/geode/distributed/ServerCommand.java ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145081#comment-17145081 ] ASF GitHub Bot commented on GEODE-8200: --- kirklund commented on a change in pull request #5299: URL: https://github.com/apache/geode/pull/5299#discussion_r445691690 ## File path: geode-junit/src/main/java/org/apache/geode/test/junit/rules/gfsh/GfshRule.java ## @@ -199,4 +220,23 @@ private void stopMembers(GfshExecution gfshExecution) { } execute(GfshScript.of(stopMemberScripts).withName("Stop-Members")); } + + public static String startServerCommand(String name, int port, int connectedLocatorPort) { Review comment: These new static methods don't really fit well in GfshRule. I think you should probably move them to two new classes (probably not a rule) named something like ServerCommandBuilder and LocatorCommandBuilder, and explode the parameters to each have their own setter type method: ``` String startServerCommand = new ServerCommandBuilder() .withPort(port) .withLocator(locatorPort) .create(name); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144488#comment-17144488 ] ASF GitHub Bot commented on GEODE-8200: --- jchen21 commented on a change in pull request #5299: URL: https://github.com/apache/geode/pull/5299#discussion_r445226334 ## File path: geode-assembly/src/upgradeTest/java/org/apache/geode/management/DeploymentManagementUpgradeTest.java ## @@ -55,16 +84,16 @@ public static void beforeClass() throws Exception { @Test public void newLocatorCanReadOldConfigurationData() throws IOException { -File workingDir = tempFolder.newFolder(); int[] ports = AvailablePortHelper.getRandomAvailableTCPPorts(3); -oldGfsh.execute("start locator --name=test --port=" + ports[0] + " --http-service-port=" -+ ports[1] + " --dir=" + workingDir.getAbsolutePath() + " --J=-Dgemfire.jmx-manager-port=" -+ ports[2], -"deploy --jar=" + clusterJar.getAbsolutePath(), -"shutdown --include-locators"); +GfshExecution execute = +GfshScript.of(startLocatorCommand("test", ports[0], ports[2], ports[1], 0)) Review comment: For code readability, it is better to have some meaningful names for the ports. Before the code change, it is easy to identify the purpose of a port with the gfsh option in the code. Now it is not that straight forward to tell ports[0] is locator port. port[2] is JMX port and ports[1] is HTTP port, unless you see to the definition of `startLocatorCommand`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144391#comment-17144391 ] ASF GitHub Bot commented on GEODE-8200: --- jinmeiliao opened a new pull request #5299: URL: https://github.com/apache/geode/pull/5299 * improve some backward compatibility test to cover more versions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI > Attachments: GEODE-8200-exportedLogs.zip > > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119989#comment-17119989 ] Anilkumar Gingade commented on GEODE-8200: -- probable workaround be, restart locator and server in sequence. > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Priority: Major > Labels: GeodeOperationAPI > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119985#comment-17119985 ] Anilkumar Gingade commented on GEODE-8200: -- The test scenario is: 3 locators 3 servers Concurrent "rolling restart locator" and "rolling restart server" In rolling restart locator: For each locator: - Stop locator - Start locator - Wait for locator to come online (using rest api) In rolling restart server: For each Server: - Call rest rebalance - Wait for rebalance to complete - Stop Server - Start Server - Wait for Server to come online (using rest api) > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Priority: Major > Labels: GeodeOperationAPI > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119892#comment-17119892 ] Jinmei Liao commented on GEODE-8200: If a locator that initiated the rebalance operation went down before it can record the "completed" state of the operation, then that operation status will be "orphaned". i.e. the executor died, there is no one to execute the "whenComplete" section of the code here: https://github.com/apache/geode/blob/57cc3c7b40816bc1b7bbae80481dea608c7caff5/geode-core/src/main/java/org/apache/geode/management/internal/operation/OperationManager.java#L70 That operation id will be stuck in "in_progress" status and kept in the history forever, because apparently we only clean the "done" record. > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Priority: Major > Labels: GeodeOperationAPI > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever
[ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119174#comment-17119174 ] Geode Integration commented on GEODE-8200: -- A Pivotal Tracker story has been created for this Issue: https://www.pivotaltracker.com/story/show/173071677 > Rebalance operations stuck in "IN_PROGRESS" state forever > - > > Key: GEODE-8200 > URL: https://issues.apache.org/jira/browse/GEODE-8200 > Project: Geode > Issue Type: Bug > Components: management >Reporter: Aaron Lindsey >Priority: Major > Labels: GeodeOperationAPI > > We use the management REST API to call rebalance immediately before stopping > a server to limit the possibility of data loss. In a cluster with 3 locators, > 3 servers, and no regions, we noticed that sometimes the rebalance operation > never ends if one of the locators is restarting concurrently with the > rebalance operation. > More specifically, the scenario where we see this issue crop up is during an > automated "rolling restart" operation in a Kubernetes environment which > proceeds as follows: > * At most one locator and one server are restarting at any point in time > * Each locator/server waits until the previous locator/server is fully online > before restarting > * Immediately before stopping a server, a rebalance operation is performed > and the server is not stopped until the rebalance operation is completed > The impact of this issue is that the "rolling restart" operation will never > complete, because it cannot proceed with stopping a server until the > rebalance operation is completed. A human is then required to intervene and > manually trigger a rebalance and stop the server. This type of "rolling > restart" operation is triggered fairly often in Kubernetes — any time part of > the configuration of the locators or servers changes. > The following JSON is a sample response from the management REST API that > shows the rebalance operation stuck in "IN_PROGRESS". > {code} > { > "statusCode": "IN_PROGRESS", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";, > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"; > }, > "operationStart": "2020-05-27T22:38:30.619Z", > "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7", > "operation": { > "simulate": false > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)