[ https://issues.apache.org/jira/browse/CASSANDRA-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080606#comment-17080606 ]
Gianluca Righetto edited comment on CASSANDRA-15551 at 4/10/20, 4:11 PM: ------------------------------------------------------------------------- The issue here is that once the this line is executed in MoveTest's @Before method, {{StorageService.instance.getTokenMetadata().clearUnsafe()}}, the {{GossipStage}} thread kicks in and starts evicting the stale endpoints from membership, which may happen in parallel while another test method is already running. To reproduce this in an IDE, you can set breakpoints at: [https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/test/unit/org/apache/cassandra/Util.java#L222] and [https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/src/java/org/apache/cassandra/gms/Gossiper.java#L524] If the main thread starts executing the second iteration of the loop in {{createInitialRing}} while the GossipStage thread is removing the endpoints in {{evictFromMembership}}, it will throw a NPE down the road. The fix I submitted basically makes the main thread wait for all endpoints to be evicted in between tests, such that the next test starts in a clean state. Pull request: [https://github.com/apache/cassandra/pull/533] Java 11 Unit Tests results: [https://circleci.com/gh/grighetto/cassandra/68] Java 8 Unit Tests results: [https://circleci.com/gh/grighetto/cassandra/65] was (Author: gianluca): The issue here is that once the this line is executed in the @Before setup method, {{StorageService.instance.getTokenMetadata().clearUnsafe()}}, the {{GossipStage}} thread kicks in and starts evicting the stale endpoints from membership, which may happen in parallel while another test method is already running. To reproduce this in an IDE, you can set breakpoints at: https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/test/unit/org/apache/cassandra/Util.java#L222 and https://github.com/apache/cassandra/blob/1ce3c1c039561c15892115af37e0c7abf260bc6b/src/java/org/apache/cassandra/gms/Gossiper.java#L524 If the main thread starts executing the second iteration of the loop in {{createInitialRing}} while the GossipStage thread is removing the endpoints in {{evictFromMembership}}, it will throw a NPE down the road. The fix I submitted basically makes the main thread wait for all endpoints to be evicted in between tests, such that the next test starts in a clean state. Pull request: https://github.com/apache/cassandra/pull/533 Java 11 Unit Tests results: https://circleci.com/gh/grighetto/cassandra/68 Java 8 Unit Tests results: https://circleci.com/gh/grighetto/cassandra/65 > Fix flaky tests org.apache.cassandra.service.MoveTest testStateJumpToNormal > and testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes > -------------------------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-15551 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15551 > Project: Cassandra > Issue Type: Bug > Components: Test/unit > Reporter: David Capwell > Assignee: Gianluca Righetto > Priority: Normal > Labels: pull-request-available > Fix For: 4.0-alpha > > Time Spent: 10m > Remaining Estimate: 0h > > testStateJumpToNormal failure was on java 11 > {code} > java.lang.NullPointerException > at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1028) > at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1023) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2513) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:2055) > at org.apache.cassandra.Util.createInitialRing(Util.java:225) > at > org.apache.cassandra.service.MoveTest.testStateJumpToNormal(MoveTest.java:935) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {code} > testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes failure was on > java 8 > {code} > java.lang.NullPointerException > at > org.apache.cassandra.service.StorageService.updatePeerInfo(StorageService.java:2174) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2511) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:2055) > at org.apache.cassandra.Util.createInitialRing(Util.java:225) > at > org.apache.cassandra.service.MoveTest.testMoveWithPendingRangesNetworkStrategyRackAwareThirtyNodes(MoveTest.java:199) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org