Siddharth Teotia created HELIX-818:
--------------------------------------

             Summary: State transition callbacks for online -> offline and 
offline -> dropped are sometimes not received
                 Key: HELIX-818
                 URL: https://issues.apache.org/jira/browse/HELIX-818
             Project: Apache Helix
          Issue Type: Bug
            Reporter: Siddharth Teotia


As part of a cluster integration tests in Pinot, we have seen that state 
transition callbacks are sometimes not received. Each unit test [here 
|[https://github.com/apache/incubator-pinot/pull/4498/commits/75c0d7eb76f38fd60497876eb7aa501ae048b05c#diff-30ee437b5c9317721c0d35de40a4f36dR456]]
 rebalances tables and moves segments between servers. 

After the test finishes rebalancing (which also means that external view has 
converged to new ideal state because we ensure it), we check for stats related 
to state transitions from ONLINE to OFFLINE and OFFLINE to DROPPED with the 
expectation that as part of rebalance, if a segment lost a server, then it 
should have received these 2 transitions. The test has a custom state model 
factory registered with Helix for each fake server it creates. 

For the above 2 state transitions, the factory methods bump stats and that's 
what we check for in tests. 

Earlier when these were failing intermittently, it was possibly due to stat 
variables not being volatile. The PR pointed to above actually attempts to 
re-enable these tests by changing the stats to atomic int since they will be 
bumped by helix code that invokes callback.

Seems like even after this, for some reasons, once in a while I have seen some 
test failing randomly at any of the 2 state transitions – this happens both in 
travis builds and sometimes running the test locally in IDE

An example failure is [here 
|[https://travis-ci.org/apache/incubator-pinot/jobs/569442912]]

Wondering if there is a potential bug due to which sometimes the state 
transition callbacks are not invoked. This begs the question how is external 
view getting updated as expected since our tests check for that too (server 
that lost a segment as part of rebalancing is no longer present in the 
host-state mapping of that segment in external view). If the callback 
invocations are missed sometimes, how is it possible for current-state and 
subsequently external view to get updated in the right manner/

Thanks for help

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to