[ https://issues.apache.org/jira/browse/IMPALA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784641#comment-17784641 ]
ASF subversion and git services commented on IMPALA-12550: ---------------------------------------------------------- Commit 7403d10a55397f81784f369aa501b9c072d198b6 in impala's branch refs/heads/master from wzhou-code [ https://gitbox.apache.org/repos/asf?p=impala.git;h=7403d10a5 ] IMPALA-12550: Fix flaky test test_statestored_auto_failover_with_disabling_network Test test_statestored_auto_failover_with_disabling_network failed occasionally due to delay of HA Handshake or HA heartbeat RPCs between two statestore instances. Sometimes the active statestore took a few minutes to respond to the handshake requests from standby statestore. This patch fixes the issue by not holding mutex ha_lock_ when sending HA handshake and HA heartbeat. Redundant HA heartbeats are handled on receiver side. Redundant HA handshakes are harmless. Testing: - Repeatedly ran test_statestored_auto_failover_with_disabling_network on Jenkins for hundreds of times without failure. - Repeatedly ran test_statestored_auto_failover_with_disabling_network on local machine for thousand times without failure. - Repeatedly ran all tests in test_statestored_ha.py for over 12 hours on Jenkins without failure. - Passed core tests. Change-Id: I515bbaaddfb4bf9bd2a39414cd6e3e4590dfbfb1 Reviewed-on: http://gerrit.cloudera.org:8080/20689 Reviewed-by: Riza Suminto <riza.sumi...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > test_statestored_auto_failover_with_disabling_network flaky > ----------------------------------------------------------- > > Key: IMPALA-12550 > URL: https://issues.apache.org/jira/browse/IMPALA-12550 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 4.4.0 > Reporter: Wenzhe Zhou > Assignee: Wenzhe Zhou > Priority: Major > > TestStatestoredHA.test_statestored_auto_failover_with_disabling_network > failed with following stack trace when repeatedly run this test. > tests/custom_cluster/test_statestored_ha.py:645: in > test_statestored_auto_failover_with_disabling_network > "statestore.in-ha-recovery-mode", expected_value=False, timeout=120) > tests/common/impala_service.py:144: in wait_for_metric_value > self.__metric_timeout_assert(metric_name, expected_value, timeout) > tests/common/impala_service.py:213: in __metric_timeout_assert > assert 0, assert_string > E AssertionError: Metric statestore.in-ha-recovery-mode did not reach value > False in 120s. > From log messages, the issue was caused by the delay of HA Handshake RPC > between two statestore instances. Sometimes the active statestore took a few > minutes to response the handshake requests from standby statestore. > This issue is different from IMPALA-12525, which was caused locking issue on > subscribers side. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org