Siddharth Teotia created HELIX-818:
--------------------------------------
Summary: State transition callbacks for online -> offline and
offline -> dropped are sometimes not received
Key: HELIX-818
URL: https://issues.apache.org/jira/browse/HELIX-818
Project: Apache Helix
Issue Type: Bug
Reporter: Siddharth Teotia
As part of a cluster integration tests in Pinot, we have seen that state
transition callbacks are sometimes not received. Each unit test [here
|[https://github.com/apache/incubator-pinot/pull/4498/commits/75c0d7eb76f38fd60497876eb7aa501ae048b05c#diff-30ee437b5c9317721c0d35de40a4f36dR456]]
rebalances tables and moves segments between servers.
After the test finishes rebalancing (which also means that external view has
converged to new ideal state because we ensure it), we check for stats related
to state transitions from ONLINE to OFFLINE and OFFLINE to DROPPED with the
expectation that as part of rebalance, if a segment lost a server, then it
should have received these 2 transitions. The test has a custom state model
factory registered with Helix for each fake server it creates.
For the above 2 state transitions, the factory methods bump stats and that's
what we check for in tests.
Earlier when these were failing intermittently, it was possibly due to stat
variables not being volatile. The PR pointed to above actually attempts to
re-enable these tests by changing the stats to atomic int since they will be
bumped by helix code that invokes callback.
Seems like even after this, for some reasons, once in a while I have seen some
test failing randomly at any of the 2 state transitions – this happens both in
travis builds and sometimes running the test locally in IDE
An example failure is [here
|[https://travis-ci.org/apache/incubator-pinot/jobs/569442912]]
Wondering if there is a potential bug due to which sometimes the state
transition callbacks are not invoked. This begs the question how is external
view getting updated as expected since our tests check for that too (server
that lost a segment as part of rebalancing is no longer present in the
host-state mapping of that segment in external view). If the callback
invocations are missed sometimes, how is it possible for current-state and
subsequently external view to get updated in the right manner/
Thanks for help
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)