Quanlong Huang created IMPALA-14266:
---------------------------------------
Summary: test_catalogd_manual_failover_with_failed_rpc is flaky in
__verify_impalad_active_catalogd_port
Key: IMPALA-14266
URL: https://issues.apache.org/jira/browse/IMPALA-14266
Project: IMPALA
Issue Type: Bug
Components: Test
Reporter: Quanlong Huang
Saw a failure in TestCatalogdHA.test_catalogd_manual_failover_with_failed_rpc
on a private branch:
{code:python}
custom_cluster/test_catalogd_ha.py:393: in
test_catalogd_manual_failover_with_failed_rpc
self.__test_catalogd_manual_failover(unique_database)
custom_cluster/test_catalogd_ha.py:333: in __test_catalogd_manual_failover
self.__verify_impalad_active_catalogd_port(0, catalogd_service_2)
custom_cluster/test_catalogd_ha.py:84: in __verify_impalad_active_catalogd_port
assert int(catalog_service_port) ==
catalogd_service.get_catalog_service_port()
E assert 26000 == 26001
E + where 26000 = int('26000')
E + and 26001 = <bound method CatalogdService.get_catalog_service_port of
<tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>>()
E + where <bound method CatalogdService.get_catalog_service_port of
<tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>> =
<tests.common.impala_service.CatalogdService object at
0x7fdd80eb2b90>.get_catalog_service_port{code}
The test code assumes that after the new active catalogd updates its
"active-status" metric, both statestore and all coordinators should have
updated their "active-catalogd-address":
{code:python}
# Kill active catalogd
active_catalogd.kill()
# Wait for long enough for the statestore to detect the failure of active
catalogd
# and assign active role to standby catalogd.
catalogd_service_2.wait_for_metric_value(
"catalog-server.active-status", expected_value=True, timeout=30)
assert catalogd_service_2.get_metric_value(
"catalog-server.ha-number-active-status-change") > 0
assert catalogd_service_2.get_metric_value("catalog-server.active-status")
# Verify ports of the active catalogd of statestore and impalad are
matching with
# the catalog service port of the current active catalogd.
self.__verify_statestore_active_catalogd_port(catalogd_service_2)
self.__verify_impalad_active_catalogd_port(0, catalogd_service_2) # <----
Failed here
self.__verify_impalad_active_catalogd_port(1, catalogd_service_2)
self.__verify_impalad_active_catalogd_port(2, catalogd_service_2){code}
I think that's not true since one subscriber (the new active catalogd) has
processed the update doesn't guarantee all other subscribers also have
processed the update. We should wait for their metrics to be updated with a
timeout.
CC [~wzhou], [~rizaon]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)