[jira] [Commented] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster
[ https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469005#comment-17469005 ] ASF subversion and git services commented on KUDU-3346: --- Commit 5ef0168cf0ae4471632d63cad223d7301f415982 in kudu's branch refs/heads/master from zhangyifan27 [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=5ef0168 ] KUDU-3346: fix rebalancer tool fails to run with '--ignored_tservers' Prior to this patch the validity of 'ignored_tservers' was checked when 'BuildClusterinfo', which leads to a failure when the 'raw_info' only contains contains information of tservers on a specific location. This patch fix it by moving the parameter validity check into 'KsckResultsToClusterRawInfo', because ksck results contain original cluster information. I noticed 'ClusterInfo::tservers_to_empty' is not necessary to be built when 'BuildClusterInfo', because we use this info only for printing cluster's stats and running IgnoredTserverRunner. This should be refactored in follow-up patch. This patch adds a regression test for the issue and I also verified this fix on a real cluster. Change-Id: I1361f562f3e886077a79c3de8ea5fb2ebb8df6e9 Reviewed-on: http://gerrit.cloudera.org:8080/18114 Reviewed-by: Andrew Wong Tested-by: Andrew Wong > Rebalance fails when trying to decommission tserver on a rack-aware cluster > --- > > Key: KUDU-3346 > URL: https://issues.apache.org/jira/browse/KUDU-3346 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Georgiana Ogrean >Assignee: YifanZhang >Priority: Major > Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z > > > When following the steps [in the > docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html] > for decommissioning a tserver, the rebalance job fails with: > {code:java} > Invalid argument: ignored tserver is not reported among know > tservers > {code} > Steps followed: > 1. Checked that ksck passes. > 2. Put the tserver to be decommissioned in maintenance mode. > {code:java} > sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES > 5ae499b1b870419daabb0e8da90ef233 {code} > 3. Ran rebalance with {{-ignored_tservers}} and > {{-move_replicas_from_ignored_tservers}} flags. > {code:java} > sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES > -move_replicas_from_ignored_tservers > -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code} > The logs for the rebalace command are attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster
[ https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464941#comment-17464941 ] YifanZhang commented on KUDU-3346: -- I think there is something wrong when populating `ClusterInfo::tservers_to_empty`, because sometimes the `ClusterRawInfo` only contains tservers/tablets info of a specific location. I plan to fix it. > Rebalance fails when trying to decommission tserver on a rack-aware cluster > --- > > Key: KUDU-3346 > URL: https://issues.apache.org/jira/browse/KUDU-3346 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Georgiana Ogrean >Priority: Major > Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z > > > When following the steps [in the > docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html] > for decommissioning a tserver, the rebalance job fails with: > {code:java} > Invalid argument: ignored tserver is not reported among know > tservers > {code} > Steps followed: > 1. Checked that ksck passes. > 2. Put the tserver to be decommissioned in maintenance mode. > {code:java} > sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES > 5ae499b1b870419daabb0e8da90ef233 {code} > 3. Ran rebalance with {{-ignored_tservers}} and > {{-move_replicas_from_ignored_tservers}} flags. > {code:java} > sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES > -move_replicas_from_ignored_tservers > -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code} > The logs for the rebalace command are attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster
[ https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464857#comment-17464857 ] Georgiana Ogrean commented on KUDU-3346: In case it helps with getting to the bottom of this: After noticing that some logs appear twice for tservers in us-east-1c, e.g. {code:java} I1223 13:52:53.569551 11613 rebalancer.cc:305] found tserver ca2b022920654fd2aacd320adfe39148 at location '/us-east-1/us-east-1c'{code} I tried placing in maintenance a tserver in that region and then running rebalance with the same flags. It fails with the same error as above, but while for the other two regions in our cluster all it printed before failing was the *Locations load summary* table, when ignoring a tserver in us-east-1c it also prints the *replica distribution summary* tables for that region (both per-server and per-table). I attached the rebalance log file when the job is run with a tserver in us-east-1c ignored after being put in maintenance. [^rebalance_ignored_tserver_1c.log.Z] > Rebalance fails when trying to decommission tserver on a rack-aware cluster > --- > > Key: KUDU-3346 > URL: https://issues.apache.org/jira/browse/KUDU-3346 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Georgiana Ogrean >Priority: Major > Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z > > > When following the steps [in the > docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html] > for decommissioning a tserver, the rebalance job fails with: > {code:java} > Invalid argument: ignored tserver is not reported among know > tservers > {code} > Steps followed: > 1. Checked that ksck passes. > 2. Put the tserver to be decommissioned in maintenance mode. > {code:java} > sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES > 5ae499b1b870419daabb0e8da90ef233 {code} > 3. Ran rebalance with {{-ignored_tservers}} and > {{-move_replicas_from_ignored_tservers}} flags. > {code:java} > sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES > -move_replicas_from_ignored_tservers > -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code} > The logs for the rebalace command are attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001)