[jira] [Comment Edited] (CASSANDRA-13407) test failure at RemoveTest.testBadHostId

2017-04-10 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960470#comment-15960470
 ] 

Alex Petrov edited comment on CASSANDRA-13407 at 4/10/17 7:55 AM:
--

Looks like I was able to gather a bit more information on the issue. To confirm 
what you're saying. It is possible to reproduce locally by tweaking timeouts 
(particularly making the gossip interval shorter, to emulate the slow VM). 

{code}
INFO  [GossipTasks:1] 2017-04-03 23:05:53,433 Gossiper.java:810 - FatClient 
/127.0.0.4 has been silent for 1000ms, removing from gossip
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:432 - removing 
endpoint /127.0.0.4
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:407 - evicting 
/127.0.0.4 from gossip
{code}

After that we can get an NPE either in {{Gossiper#getHostId}} or 
{{StorageService#isStatus}}. 

The patch for 2.0 and 3.0 is slightly different, as if we do not initialise 
schema, we'll get the following error: 

{code}
[junit] junit.framework.AssertionFailedError: []
[junit] at 
org.apache.cassandra.db.lifecycle.Tracker.getMemtableFor(Tracker.java:312)
[junit] at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1185)
[junit] at 
org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
[junit] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
[junit] at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:566)
[junit] at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:556)
[junit] at 
org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:295)
[junit] at 
org.apache.cassandra.db.SystemKeyspace.updatePeerInfo(SystemKeyspace.java:712)
[junit] at 
org.apache.cassandra.service.StorageService.updatePeerInfo(StorageService.java:1801)
[junit] at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2014)
[junit] at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1669)
[junit] at org.apache.cassandra.Util.createInitialRing(Util.java:213)
[junit] at 
org.apache.cassandra.service.RemoveTest.setup(RemoveTest.java:77)
{code}

|[2.2|https://github.com/apache/cassandra/compare/2.2...ifesdjeen:13407-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-2.2-testall/]|
|[3.0|https://github.com/apache/cassandra/compare/3.0...ifesdjeen:13407-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-3.0-testall/]|
|[3.11|https://github.com/apache/cassandra/compare/3.11...ifesdjeen:13407-3.11]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-3.11-testall/]|
|[trunk|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:13407-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-trunk-testall/]|


was (Author: ifesdjeen):
Looks like I was able to gather a bit more information on the issue. To confirm 
what you're saying. It is possible to reproduce locally by tweaking timeouts 
(particularly making the gossip interval shorter, to emulate the slow VM). 

{code}
INFO  [GossipTasks:1] 2017-04-03 23:05:53,433 Gossiper.java:810 - FatClient 
/127.0.0.4 has been silent for 1000ms, removing from gossip
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:432 - removing 
endpoint /127.0.0.4
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:407 - evicting 
/127.0.0.4 from gossip
{code}

After that we can get an NPE either in {{Gossiper#getHostId}} or 
{{StorageService#isStatus}}. 

The patch for 2.0 and 3.0 is slightly different, as if we do not initialise 
schema, we'll get the following error: 

{code}
[junit] junit.framework.AssertionFailedError: []
[junit] at 
org.apache.cassandra.db.lifecycle.Tracker.getMemtableFor(Tracker.java:312)
[junit] at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1185)
[junit] at 
org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
[junit] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
[junit] at 

[jira] [Comment Edited] (CASSANDRA-13407) test failure at RemoveTest.testBadHostId

2017-04-07 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960470#comment-15960470
 ] 

Alex Petrov edited comment on CASSANDRA-13407 at 4/7/17 8:45 AM:
-

Looks like I was able to gather a bit more information on the issue. To confirm 
what you're saying. It is possible to reproduce locally by tweaking timeouts 
(particularly making the gossip interval shorter, to emulate the slow VM). 

{code}
INFO  [GossipTasks:1] 2017-04-03 23:05:53,433 Gossiper.java:810 - FatClient 
/127.0.0.4 has been silent for 1000ms, removing from gossip
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:432 - removing 
endpoint /127.0.0.4
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:407 - evicting 
/127.0.0.4 from gossip
{code}

After that we can get an NPE either in {{Gossiper#getHostId}} or 
{{StorageService#isStatus}}. 

The patch for 2.0 and 3.0 is slightly different, as if we do not initialise 
schema, we'll get the following error: 

{code}
[junit] junit.framework.AssertionFailedError: []
[junit] at 
org.apache.cassandra.db.lifecycle.Tracker.getMemtableFor(Tracker.java:312)
[junit] at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1185)
[junit] at 
org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
[junit] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
[junit] at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
[junit] at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:566)
[junit] at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:556)
[junit] at 
org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:295)
[junit] at 
org.apache.cassandra.db.SystemKeyspace.updatePeerInfo(SystemKeyspace.java:712)
[junit] at 
org.apache.cassandra.service.StorageService.updatePeerInfo(StorageService.java:1801)
[junit] at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2014)
[junit] at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1669)
[junit] at org.apache.cassandra.Util.createInitialRing(Util.java:213)
[junit] at 
org.apache.cassandra.service.RemoveTest.setup(RemoveTest.java:77)
{code}

|[2.2|https://github.com/apache/cassandra/compare/2.2...ifesdjeen:13407-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-2.2-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-2.2-dtest/]|
|[3.0|https://github.com/apache/cassandra/compare/3.0...ifesdjeen:13407-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-3.0-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-3.0-dtest/]|
|[3.11|https://github.com/apache/cassandra/compare/3.11...ifesdjeen:13407-3.11]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-3.11-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-3.11-dtest/]|
|[trunk|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:13407-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-trunk-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-13407-trunk-dtest/]|


was (Author: ifesdjeen):
Looks like I was able to gather a bit more information on the issue. To confirm 
what you're saying. It is possible to reproduce locally by tweaking timeouts 
(particularly making the gossip interval shorter, to emulate the slow VM). 

{code}
INFO  [GossipTasks:1] 2017-04-03 23:05:53,433 Gossiper.java:810 - FatClient 
/127.0.0.4 has been silent for 1000ms, removing from gossip
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:432 - removing 
endpoint /127.0.0.4
DEBUG [GossipTasks:1] 2017-04-03 23:05:53,436 Gossiper.java:407 - evicting 
/127.0.0.4 from gossip
{code}

After that we can get an NPE either in {{Gossiper#getHostId}} or 
{{StorageService#isStatus}}. 

> test failure at RemoveTest.testBadHostId
> 
>
> Key: CASSANDRA-13407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13407
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>
> Example trace:
> {code}
> java.lang.NullPointerException
>   at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:881)
>   at 

[jira] [Comment Edited] (CASSANDRA-13407) test failure at RemoveTest.testBadHostId

2017-04-06 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959301#comment-15959301
 ] 

Joel Knighton edited comment on CASSANDRA-13407 at 4/6/17 5:02 PM:
---

For posterity, this is the race possible when the Gossiper is started, as far 
as I can tell.

In setup, we initialize a fake ring using Util.createInitialRing. This will 
intialize the nodes in an unsafe manner and then inject the token states. If a 
status check runs before the tokens state is set, the previously decommissioned 
node will look like a fat client, since it won't have tokens and will not have 
a DEAD_STATE. Since we aren't gossiping, we won't have heard from it in greater 
than fatClientTimeout, so we'll remove it. If this races with the ss.onChange 
in createInitialRing, we can remove the endpointstate while processing it, 
which will cause a NPE as above.

We also need to remove SchemaLoader.loadSchema() as you did in the patch - this 
is because it starts the Gossiper as well. This is fine; we don't appear to 
need it.

The patch looks good - the race exists in theory on 2.1/2.2, but it appears to 
only manifest on 3.0+. I don't think it is worth committing to 2.1 for that 
reason - let's do 2.2+ forward and run the test at least once on each branch 
before committing.




was (Author: jkni):
For posterity, this is the race possible when the Gossiper is started, as far 
as I can tell.

In setup, we initialize a fake ring using Util.createInitialRing. This will 
intialize the nodes in an unsafe manner and then inject the token states. If a 
status check runs before the tokens state is set, the previously decommissioned 
node will look like a fat client, since it won't have tokens and will not have 
a DEAD_STATE. Since we aren't gossiping, we won't have heard from it in greater 
than fatClientTimeout, so we'll remove it. If this races with the ss.onChange 
in createInitialRing, we can remove the endpointstate while processing it, 
which will cause a NPE as above. This race can be seen at 16:15:51,205 in the 
log linked from the test failure.

We also need to remove SchemaLoader.loadSchema() as you did in the patch - this 
is because it starts the Gossiper as well. This is fine; we don't appear to 
need it.

The patch looks good - the race exists in theory on 2.1/2.2, but it appears to 
only manifest on 3.0+. I don't think it is worth committing to 2.1 for that 
reason - let's do 2.2+ forward and run the test at least once on each branch 
before committing.



> test failure at RemoveTest.testBadHostId
> 
>
> Key: CASSANDRA-13407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13407
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>
> Example trace:
> {code}
> java.lang.NullPointerException
>   at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:881)
>   at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:876)
>   at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2201)
>   at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1855)
>   at org.apache.cassandra.Util.createInitialRing(Util.java:216)
>   at org.apache.cassandra.service.RemoveTest.setup(RemoveTest.java:89)
> {code} 
> [failure 
> example|https://cassci.datastax.com/job/trunk_testall/1491/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/]
> [history|https://cassci.datastax.com/job/trunk_testall/lastCompletedBuild/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/history/]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)