Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/23194 )
Change subject: IMPALA-14220 (part 2): Delay AcceptRequest until catalog is stable ...................................................................... Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.h File be/src/catalog/catalog-server.h: http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.h@229 PS4, Line 229: If operation also need to obtain catalog_lock_, this lock must be obtained first. We don't do this in all places yet, e.g. UpdateCatalogTopicCallback. But it seems we don't need this. Just enforce the lock acquiring order might be enough. http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc File be/src/catalog/catalog-server.cc: http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@801 PS4, Line 801: if (!FLAGS_enable_catalogd_ha) return Status::OK(); When HA is disabled, should we wait until catalog reset starts in JVM, i.e. wait until 'current_catalog_version' reaches 'min_catalog_version_to_serve_', which is the issue we fixed in IMPALA-13850 (part 2) ? http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@987 PS4, Line 987: a little bit of inconsistency It would impact correctness, e.g. wrong results or query failures. So I'm not sure whether we should allow this, if we announce our HA is zero-downtime. http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@989 PS4, Line 989: int64_t min_version = current_catalog_version + CATALOG_VERSION_INCREMENT_ON_RESET; After we get the catalog version, it could be increased by concurrent operations, e.g. from EventProcessor or CatalogdTableInvalidator. This is different than the startup process where EventProcessor is paused and no tables can be invalidated by CatalogdTableInvalidator. Ideally we should somehow "freeze" the catalog version until the reset thread locks it. Or just pause EventProcessor and CatalogdTableInvalidator since these are the only two sources that I can think about that are still running in the standby catalogd. However, I think the currret hack still works in practise since 100 is a big gap in catalog version. http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@994 PS4, Line 994: } We should log the error if status is not OK. http://gerrit.cloudera.org:8080/#/c/23194/4/tests/custom_cluster/test_catalogd_ha.py File tests/custom_cluster/test_catalogd_ha.py: http://gerrit.cloudera.org:8080/#/c/23194/4/tests/custom_cluster/test_catalogd_ha.py@538 PS4, Line 538: skip_func_test=True It should pass without this since --catalogd_ha_reset_metadata_on_failover=true -- To view, visit http://gerrit.cloudera.org:8080/23194 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I370d21319335318e441ec3c3455bac4227803900 Gerrit-Change-Number: 23194 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Comment-Date: Tue, 22 Jul 2025 00:58:58 +0000 Gerrit-HasComments: Yes
