Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14222 )
Change subject: KUDU-2069 p4: stop replication from failed servers in maintenance mode ...................................................................... Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc File src/kudu/consensus/quorum_util-test.cc: http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@645 PS5, Line 645: // Even if the replica to replace is meant to be ignored on failure, we : // should honor the replacement and try to add a replica. : for (char health : { '+', '-', '?' }) { : RaftConfigPB config; : AddPeer(&config, "A", V, health, {{"REPLACE", true}}); : AddPeer(&config, "B", V, '+'); : AddPeer(&config, "C", V, '+'); : EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" })); : } : { : RaftConfigPB config; : AddPeer(&config, "A", V, '+', {{"REPLACE", true}}); : AddPeer(&config, "B", V, '-'); : AddPeer(&config, "C", V, '+'); : // Ignoring failures shouldn't impede our ability to add a replica when the : > nit: maybe, put this under for () cycle to iterate over the health status o Done http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@675 PS5, Line 675: p > Does it make sense to add coverage for other health statuses as well (unkno Done http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@684 PS5, Line 684: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" })); : // But when there is a failure that isn't suppose > ditto Done http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc File src/kudu/consensus/quorum_util.cc: http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc@455 PS5, Line 455: if (VLOG_IS_ON(1) && ignore_failure_for_underreplication) { : VLOG(1) << Substitute("ignoring $0 if failed", peer_uuid); : } > Does it deserve VLOG(1) it's not important at all? Done http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc File src/kudu/integration-tests/maintenance_mode-itest.cc: http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@269 PS5, Line 269: ASSERT_OK(ChangeTServerState(maintenance_uuid, TServerStateChangePB::EXIT_MAINTENANCE_MODE)); > Thank you for adding this scenario. Ack http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@379 PS5, Line 379: void SetUp() override { : SKIP_IF_SLOW_NOT_ALLOWED(); : NO_FATALS(MaintenanceModeITest::SetUp()); : NO_FATALS(SetUpCluster(5)); > Could it happen that replicas at the server fell behind WAL GC threshold si Good point. I've separated the concern by disabling log GC and testing for FAILED_UNRECOVERABLE sepcifically using disk failures. http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@385 PS5, Line 385: SKIP_IF_SLOW_NOT_ALLOWED(); > Does it make sense to verify the number of written rows after stopping the Done -- To view, visit http://gerrit.cloudera.org:8080/14222 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db Gerrit-Change-Number: 14222 Gerrit-PatchSet: 6 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Sun, 29 Sep 2019 06:18:15 +0000 Gerrit-HasComments: Yes