[ 
https://issues.apache.org/jira/browse/CASSANDRA-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080606#comment-17080606
 ] 

Gianluca Righetto edited comment on CASSANDRA-15551 at 4/10/20, 4:11 PM:
-------------------------------------------------------------------------

The issue here is that once the this line is executed in MoveTest's @Before 
method, {{StorageService.instance.getTokenMetadata().clearUnsafe()}}, the 
{{GossipStage}} thread kicks in and starts evicting the stale endpoints from 
membership, which may happen in parallel while another test method is already 
running.

To reproduce this in an IDE, you can set breakpoints at:

[https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/test/unit/org/apache/cassandra/Util.java#L222]

and

[https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/src/java/org/apache/cassandra/gms/Gossiper.java#L524]

If the main thread starts executing the second iteration of the loop in 
{{createInitialRing}} while the GossipStage thread is removing the endpoints in 
{{evictFromMembership}}, it will throw a NPE down the road.

The fix I submitted basically makes the main thread wait for all endpoints to 
be evicted in between tests, such that the next test starts in a clean state.

Pull request: [https://github.com/apache/cassandra/pull/533]
 Java 11 Unit Tests results: [https://circleci.com/gh/grighetto/cassandra/68]
 Java 8 Unit Tests results: [https://circleci.com/gh/grighetto/cassandra/65]


was (Author: gianluca):
The issue here is that once the this line is executed in the @Before setup 
method, {{StorageService.instance.getTokenMetadata().clearUnsafe()}}, the 
{{GossipStage}} thread kicks in and starts evicting the stale endpoints from 
membership, which may happen in parallel while another test method is already 
running.

To reproduce this in an IDE, you can set breakpoints at:

https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/test/unit/org/apache/cassandra/Util.java#L222

and

https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/src/java/org/apache/cassandra/gms/Gossiper.java#L524

If the main thread starts executing the second iteration of the loop in 
{{createInitialRing}} while the GossipStage thread is removing the endpoints in 
{{evictFromMembership}}, it will throw a NPE down the road.

The fix I submitted basically makes the main thread wait for all endpoints to 
be evicted in between tests, such that the next test starts in a clean state.

Pull request: https://github.com/apache/cassandra/pull/533
Java 11 Unit Tests results: https://circleci.com/gh/grighetto/cassandra/68
Java 8 Unit Tests results: https://circleci.com/gh/grighetto/cassandra/65

> Fix flaky tests org.apache.cassandra.service.MoveTest testStateJumpToNormal 
> and testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15551
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15551
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: David Capwell
>            Assignee: Gianluca Righetto
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0-alpha
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> testStateJumpToNormal failure was on java 11
> {code}
> java.lang.NullPointerException
>       at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1028)
>       at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1023)
>       at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2513)
>       at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2055)
>       at org.apache.cassandra.Util.createInitialRing(Util.java:225)
>       at 
> org.apache.cassandra.service.MoveTest.testStateJumpToNormal(MoveTest.java:935)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes failure was on 
> java 8
> {code}
> java.lang.NullPointerException
>       at 
> org.apache.cassandra.service.StorageService.updatePeerInfo(StorageService.java:2174)
>       at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2511)
>       at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2055)
>       at org.apache.cassandra.Util.createInitialRing(Util.java:225)
>       at 
> org.apache.cassandra.service.MoveTest.testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes(MoveTest.java:199)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to