[ 
https://issues.apache.org/jira/browse/IMPALA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884120#comment-17884120
 ] 

ASF subversion and git services commented on IMPALA-13388:
----------------------------------------------------------

Commit d2cd9b51a03dbd8b2e485ee446bf7530656ab214 in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d2cd9b51a ]

IMPALA-13388: fix unit-tests of Statestore HA for UBSAN builds

Sometimes in UBSAN builds, unit-tests of Statestore HA failed due to
Thrift RPC receiving timeout. Standby statestored failed to send
heartbeats to its subscribers so that failover was not triggered.
The Thrift RPC failures still happened after increasing TCP timeout
for Thrift RPCs between statestored and its subscribers.

This patch adds a metric for number of subscribers which recevied
heartbeats from statestored in a monitoring period. Unit-tests of
Statestored HA for UBSAN build will be skipped if statestored failed
to send heartbeats to more than half of subscribers.
For other builds, throw exception with error message which complain
Thrift RPC failure if statestored failed to send heartbeats to more
than half of subscribers.
Also fixed a bug which calls SecondsSinceHeartbeat() but compares
the retutned value with time value in milli-seconds.

Filed following up JIRA IMPALA-13399 to track the very root cause.

Testing:
 - Looped to run test_statestored_ha.py for 100 times in UBSAN
   build without failed case, but 4 iterations out of 100 have
   skipped test cases.
 - Verified that the issue did not happen for ASAN build by
   running test_statestored_ha.py for 100 times in ASAN build.
 - Passed core test.

Change-Id: Ie59d1e93c635411723f7044da52e4ab19c7d2fac
Reviewed-on: http://gerrit.cloudera.org:8080/21820
Reviewed-by: Riza Suminto <riza.sumi...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> TestStatestoredHA.test_statestored_auto_failover failed in UBSAN ARM
> --------------------------------------------------------------------
>
>                 Key: IMPALA-13388
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13388
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Test
>            Reporter: Riza Suminto
>            Assignee: Wenzhe Zhou
>            Priority: Major
>              Labels: flaky
>             Fix For: Impala 4.5.0
>
>
> TestStatestoredHA.test_statestored_auto_failover failed in UBSAN ARM 
> environment. Here is the stack trace:
> {code:java}
> Stacktrace
> custom_cluster/test_statestored_ha.py:340: in test_statestored_auto_failover
>     self.__test_statestored_auto_failover()
> custom_cluster/test_statestored_ha.py:259: in __test_statestored_auto_failover
>     "statestore.active-status", expected_value=True, timeout=120)
> common/impala_service.py:144: in wait_for_metric_value
>     self.__metric_timeout_assert(metric_name, expected_value, timeout)
> common/impala_service.py:213: in __metric_timeout_assert
>     assert 0, assert_string
> E   AssertionError: Metric statestore.active-status did not reach value True 
> in 120s.
> E   Dumping debug webpages in JSON format...
> E   Dumped memz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240917_00:07:51/json/memz.json
> E   Dumped metrics JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240917_00:07:51/json/metrics.json
> E   Dumped queries JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240917_00:07:51/json/queries.json
> E   Dumped sessions JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240917_00:07:51/json/sessions.json
> E   Dumped threadz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240917_00:07:51/json/threadz.json
> E   Dumped rpcz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240917_00:07:51/json/rpcz.json
> E   Dumping minidumps for impalads/catalogds...
> E   Dumped minidump for Impalad PID 3729004
> E   Dumped minidump for Impalad PID 3729007
> E   Dumped minidump for Impalad PID 3729011
> E   Dumped minidump for Catalogd PID 3728915
> {code}
> Maybe 120 seconds timeout is not enough in UBSAN ARM environment?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to