Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/22634 )
Change subject: IMPALA-13850: Wait until CatalogD active before resetting metadata ...................................................................... Patch Set 6: (5 comments) http://gerrit.cloudera.org:8080/#/c/22634/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22634/5//COMMIT_MSG@13 PS5, Line 13: registration. catalog_lock_ was held by GatherCatalogUpdatesThread which > Let's mention the lock is held by GatherCatalogUpdatesThread which is calli Done http://gerrit.cloudera.org:8080/#/c/22634/5//COMMIT_MSG@23 PS5, Line 23: TriggerResetMetadata that monitor and trigger this initial reset > I think coordinators will still wait for the initial catalog updates from s It appears to be the case. https://github.com/apache/impala/commit/1d63348b933b266f63d76b06eecbdf636cb45770 I wonder if sending minimal catalog update for the first time is OK to signal that CatalogD is live. Patch set 6 employs this strategy. My WIP at https://gerrit.cloudera.org/c/22640 can break down long CatalogServiceCatalog.reset() call, but it can still block at slow-loading Db. http://gerrit.cloudera.org:8080/#/c/22634/5/be/src/catalog/catalog-server.cc File be/src/catalog/catalog-server.cc: http://gerrit.cloudera.org:8080/#/c/22634/5/be/src/catalog/catalog-server.cc@590 PS5, Line 590: if (!status.ok()) { : status.AddDetail("CatalogService failed to start"); : return status; : } : // Add callback to handle notification of updating catalogd from Statestore. : if (FLAGS_enable_catalogd_ha) { : S > If FLAGS_force_catalogd_active is true, TriggerResetMetadataLocked() will b Removed. catalog_first_reset_metadata_thread_ now managed it. http://gerrit.cloudera.org:8080/#/c/22634/5/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: http://gerrit.cloudera.org:8080/#/c/22634/5/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@2322 PS5, Line 2322: DebugUtils.executeDebugAction(BackendConfig.INSTANCE.debugActions(), : DebugUtils.RESET_METADATA_LOOP_LOCKED); > What about moving this outside of the loop so the wait time is independent I plan to have another debug action RESET_METADATA_LOOP_UNLOCKED in https://gerrit.cloudera.org/c/22640/. I'd like both of them in Db loop so that we can tests 2 variety of slowdown while looping Db list. For this specific patch though, it is sufficient for test_catalogd_auto_failover_slow to have the Db loop hold the lock for over than 1 second. I think without any debug action, this loop can complete in under half of a second. http://gerrit.cloudera.org:8080/#/c/22634/5/tests/custom_cluster/test_catalogd_ha.py File tests/custom_cluster/test_catalogd_ha.py: http://gerrit.cloudera.org:8080/#/c/22634/5/tests/custom_cluster/test_catalogd_ha.py@109 PS5, Line 109: assert page.status_code == requests.codes.ok > Can we also verify the coordinator is healthy? Done -- To view, visit http://gerrit.cloudera.org:8080/22634 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I58cc66dcccedb306ff11893f2916ee5ee6a3efc1 Gerrit-Change-Number: 22634 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Tue, 25 Mar 2025 02:58:04 +0000 Gerrit-HasComments: Yes
