[jira] [Commented] (IMPALA-12550) test_statestored_auto_failover_with_disabling_network flaky

2023-11-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784641#comment-17784641
 ] 

ASF subversion and git services commented on IMPALA-12550:
--

Commit 7403d10a55397f81784f369aa501b9c072d198b6 in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=7403d10a5 ]

IMPALA-12550: Fix flaky test 
test_statestored_auto_failover_with_disabling_network

Test test_statestored_auto_failover_with_disabling_network failed
occasionally due to delay of HA Handshake or HA heartbeat RPCs between
two statestore instances. Sometimes the active statestore took a few
minutes to respond to the handshake requests from standby statestore.

This patch fixes the issue by not holding mutex ha_lock_ when sending
HA handshake and HA heartbeat. Redundant HA heartbeats are handled
on receiver side. Redundant HA handshakes are harmless.

Testing:
 - Repeatedly ran test_statestored_auto_failover_with_disabling_network
   on Jenkins for hundreds of times without failure.
 - Repeatedly ran test_statestored_auto_failover_with_disabling_network
   on local machine for thousand times without failure.
 - Repeatedly ran all tests in test_statestored_ha.py for over 12 hours
   on Jenkins without failure.
 - Passed core tests.

Change-Id: I515bbaaddfb4bf9bd2a39414cd6e3e4590dfbfb1
Reviewed-on: http://gerrit.cloudera.org:8080/20689
Reviewed-by: Riza Suminto 
Tested-by: Impala Public Jenkins 


> test_statestored_auto_failover_with_disabling_network flaky
> ---
>
> Key: IMPALA-12550
> URL: https://issues.apache.org/jira/browse/IMPALA-12550
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.4.0
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> TestStatestoredHA.test_statestored_auto_failover_with_disabling_network 
> failed with following stack trace when repeatedly run this test.
> tests/custom_cluster/test_statestored_ha.py:645: in 
> test_statestored_auto_failover_with_disabling_network
> "statestore.in-ha-recovery-mode", expected_value=False, timeout=120)
> tests/common/impala_service.py:144: in wait_for_metric_value
> self.__metric_timeout_assert(metric_name, expected_value, timeout)
> tests/common/impala_service.py:213: in __metric_timeout_assert
> assert 0, assert_string
> E   AssertionError: Metric statestore.in-ha-recovery-mode did not reach value 
> False in 120s.
> From log messages, the issue was caused by the delay of HA Handshake RPC 
> between two statestore instances. Sometimes the active statestore took a few 
> minutes to response the handshake requests from standby statestore.
> This issue is different from IMPALA-12525, which was caused locking issue on 
> subscribers side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12550) test_statestored_auto_failover_with_disabling_network flaky

2023-11-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784268#comment-17784268
 ] 

ASF subversion and git services commented on IMPALA-12550:
--

Commit 44c85e85a51bad4faca95f8771b2c2c5a686ca90 in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=44c85e85a ]

IMPALA-12525: Fix flaky test test_statestored_manual_failover

In test_statestored_manual_failover, statestore service failover is not
triggered sometimes when the network of active statestored is disabled
after manually forced failover.

During test, the network of active statestored could be disabled before
all subscribers re-registered with restarted statestored. This caused
some subscribers to not receive the notification of active statestored
change so that they could not correctly report connection states for
the requests from standby statestored.

This patch made following changes:
1) Updated the test case test_statestored_manual_failover to disable
the network of active statestored after all subscribers re-registering
with the restarted statestored.

2) Defined a new mutex active_lock_ in class StatestoreStub to protect
is_active_ since the mutex lock_ could be held for long time if the
subscriber lose the connection with statestored and enter recovery
mode.

3) Found one case that was not handled on Statestore subscribers. The
subscribers could be started before both statestore instances are
ready to accept registration requests. This caused impalad hit DCHECK.
Changed code to handle this case in this patch.
Added test cases to inject a real delay in statestored startup and
verify impalads and catalogd are able to tolerate this delay.

4) Updated address of active catalogd in the metrics of statestored
after statestore service failover.

5) Another test test_statestored_auto_failover_with_disabling_network
failed occasionally due to delay of HA Handshake RPC between two
statestore instances. The issue is tracked with IMPALA-12550. The last
two lines of the test are commented out temporarily.

Testing:
 - Repeatedly ran test_statestored_manual_failover on Jenkins for
   hundreds of times.
 - Repeatedly ran test_statestored_manual_failover on local machine for
   thousand times without failure.
 - Passed core tests

Change-Id: If03bf09d22a2875d2c1eec8a4f62eeefc5d855dc
Reviewed-on: http://gerrit.cloudera.org:8080/20657
Reviewed-by: Riza Suminto 
Reviewed-by: Michael Smith 
Tested-by: Impala Public Jenkins 


> test_statestored_auto_failover_with_disabling_network flaky
> ---
>
> Key: IMPALA-12550
> URL: https://issues.apache.org/jira/browse/IMPALA-12550
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.4.0
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> TestStatestoredHA.test_statestored_auto_failover_with_disabling_network 
> failed with following stack trace when repeatedly run this test.
> tests/custom_cluster/test_statestored_ha.py:645: in 
> test_statestored_auto_failover_with_disabling_network
> "statestore.in-ha-recovery-mode", expected_value=False, timeout=120)
> tests/common/impala_service.py:144: in wait_for_metric_value
> self.__metric_timeout_assert(metric_name, expected_value, timeout)
> tests/common/impala_service.py:213: in __metric_timeout_assert
> assert 0, assert_string
> E   AssertionError: Metric statestore.in-ha-recovery-mode did not reach value 
> False in 120s.
> From log messages, the issue was caused by the delay of HA Handshake RPC 
> between two statestore instances. Sometimes the active statestore took a few 
> minutes to response the handshake requests from standby statestore.
> This issue is different from IMPALA-12525, which was caused locking issue on 
> subscribers side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org