Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/20276 )
Change subject: IMPALA-12321: Fix the race condition for updating active catalogd ...................................................................... Patch Set 1: (14 comments) http://gerrit.cloudera.org:8080/#/c/20276/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20276/1//COMMIT_MSG@10 PS1, Line 10: during registration, or : a new active catalogd is elected nit: "on two different events: catalogd registration, and a new active catalogd election change" http://gerrit.cloudera.org:8080/#/c/20276/1//COMMIT_MSG@12 PS1, Line 12: two different kinds of RPCs nit: what is the difference? the value of 'for_registration' parameter? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/catalog/catalog-server.h File be/src/catalog/catalog-server.h: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/catalog/catalog-server.h@163 PS1, Line 163: catalog_objects_to/from_version_ nit: is this refer to catalog_objects_max_version_? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/catalog/catalog-server.h@171 PS1, Line 171: bool last_update_for_registration_ = false; This is assigned in CatalogServer::UpdateActiveCatalogd, but never used anywhere else. Maybe can be removed? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/catalog/catalog-server.cc File be/src/catalog/catalog-server.cc: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/catalog/catalog-server.cc@499 PS1, Line 499: for_registration nit: I think 'is_registration_reply' is more appropriate. http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/catalog/catalog-server.cc@523 PS1, Line 523: last_update_for_registration_ = false; nit: can move this after line 510 and delete the duplicate in line 520. http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/runtime/exec-env.h File be/src/runtime/exec-env.h: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/runtime/exec-env.h@330 PS1, Line 330: bool last_update_for_registration_ = false; This is assigned in ExecEnv::UpdateActiveCatalogd, but never used anywhere else. Maybe can be removed? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/runtime/exec-env.cc File be/src/runtime/exec-env.cc: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/runtime/exec-env.cc@692 PS1, Line 692: last_update_for_registration_ = false; nit: can move this after line 679 and delete the duplicate in line 689. http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-catalogd-mgr.cc File be/src/statestore/statestore-catalogd-mgr.cc: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-catalogd-mgr.cc@126 PS1, Line 126: DCHECK(num_registered_catalogd_ == 2); Question: If I maliciously register more than 2 CatalogD, will statestore crash hitting this DCHECK? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-subscriber.h File be/src/statestore/statestore-subscriber.h: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-subscriber.h@116 PS1, Line 116: typedef boost::function<void (bool for_registration, int64_t active_catalogd_version, : const TCatalogRegistration& catalogd_registration)> UpdateCatalogdCallback; Can we document the parameters here? I'd like to get clarification on: - What is the meaning of 'for_registration' value? - What is the meaning of negative active_catalogd_version? active_catalogd_version parameter in this function looks like a meld between has_active_catalogd and active_catalogd_version from StatestoreStub::Register. - When is catalogd_registration deemed invalid and safe to ignore? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-subscriber.cc File be/src/statestore/statestore-subscriber.cc: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-subscriber.cc@593 PS1, Line 593: active_catalogd_version = -1 Can we make sure that StateStore should never return negative active_catalogd_version before this override? Maybe add DCHECK or WARNING log. http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore-subscriber.cc@667 PS1, Line 667: if (!l.owns_lock()) { : *update_skipped = true; : return; : } nit: Is it expected that StateStore will keep retrying and subscriber will eventually owns the lock and process the RPC instead of skipping indefinitely? http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore.cc File be/src/statestore/statestore.cc: http://gerrit.cloudera.org:8080/#/c/20276/1/be/src/statestore/statestore.cc@1332 PS1, Line 1332: if (update_skipped) { : // The subscriber skipped processing this update. It's not considered as a failure : // since subscribers can decide what they do with any update. The subscriber is : // left in the receiver list so that RPC will be resent to it in next round. : ++it; Is it possible and okay for StatestoreSubcriber to repeatedly skip notification without error? Adding logs in the StateStore or the subscriber side can help triage if bad case do happen. http://gerrit.cloudera.org:8080/#/c/20276/1/common/thrift/StatestoreService.thrift File common/thrift/StatestoreService.thrift: http://gerrit.cloudera.org:8080/#/c/20276/1/common/thrift/StatestoreService.thrift@358 PS1, Line 358: 2: optional bool skipped; nit: Maybe worth to add 'string skip_reason' field for debug/trace purpose? -- To view, visit http://gerrit.cloudera.org:8080/20276 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie49947e563d43c59bdd476b28c35be69848ae12a Gerrit-Change-Number: 20276 Gerrit-PatchSet: 1 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Comment-Date: Thu, 27 Jul 2023 20:20:18 +0000 Gerrit-HasComments: Yes
