Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23194 )

Change subject: IMPALA-14220 (part 2): Delay AcceptRequest until catalog is 
stable
......................................................................


Patch Set 4:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.h
File be/src/catalog/catalog-server.h:

http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.h@229
PS4, Line 229: If operation also need to obtain catalog_lock_, this lock must 
be obtained first.
We don't do this in all places yet, e.g. UpdateCatalogTopicCallback. But it 
seems we don't need this. Just enforce the lock acquiring order might be enough.


http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@801
PS4, Line 801:   if (!FLAGS_enable_catalogd_ha) return Status::OK();
When HA is disabled, should we wait until catalog reset starts in JVM, i.e. 
wait until 'current_catalog_version' reaches 'min_catalog_version_to_serve_', 
which is the issue we fixed in IMPALA-13850 (part 2) ?


http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@987
PS4, Line 987: a little bit of inconsistency
It would impact correctness, e.g. wrong results or query failures. So I'm not 
sure whether we should allow this, if we announce our HA is zero-downtime.


http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@989
PS4, Line 989:     int64_t min_version = current_catalog_version + 
CATALOG_VERSION_INCREMENT_ON_RESET;
After we get the catalog version, it could be increased by concurrent 
operations, e.g. from EventProcessor or CatalogdTableInvalidator. This is 
different than the startup process where EventProcessor is paused and no tables 
can be invalidated by CatalogdTableInvalidator.

Ideally we should somehow "freeze" the catalog version until the reset thread 
locks it. Or just pause EventProcessor and CatalogdTableInvalidator since these 
are the only two sources that I can think about that are still running in the 
standby catalogd.

However, I think the currret hack still works in practise since 100 is a big 
gap in catalog version.


http://gerrit.cloudera.org:8080/#/c/23194/4/be/src/catalog/catalog-server.cc@994
PS4, Line 994:   }
We should log the error if status is not OK.


http://gerrit.cloudera.org:8080/#/c/23194/4/tests/custom_cluster/test_catalogd_ha.py
File tests/custom_cluster/test_catalogd_ha.py:

http://gerrit.cloudera.org:8080/#/c/23194/4/tests/custom_cluster/test_catalogd_ha.py@538
PS4, Line 538: skip_func_test=True
It should pass without this since --catalogd_ha_reset_metadata_on_failover=true



--
To view, visit http://gerrit.cloudera.org:8080/23194
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I370d21319335318e441ec3c3455bac4227803900
Gerrit-Change-Number: 23194
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Tue, 22 Jul 2025 00:58:58 +0000
Gerrit-HasComments: Yes

Reply via email to