[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2019-01-14 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742593#comment-16742593
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


Pinging on this again. 

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown, Test/dtest
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>  Labels: dtest
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-12-31 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731420#comment-16731420
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


[~jasobrown][~beobal]

So it seems to me that we should be able to ignore messages received during the 
shadow round that don't have the information we are looking for without 
erroring out.

The state we are looking to enter requires us to wait for the presence of 
specific information anyways. If we don't get it then we don't get it and we 
would go down that path anyways.

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle, Testing
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>  Labels: dtest
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-12-21 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16727108#comment-16727108
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


I haven't quite nailed it down, but basically nodes are sending gossip syn acks 
to nodes that have just started that only contain information on the node 
sending the ack. They are in response to previously received gossip syn as I 
can see the message being created in the verb handler.

My guess is that since gossip doesn't do request/response correlation we are 
getting an ack for some request, like maybe before the node was upgraded, and 
the response from that looks like an ack to the shadow round.

I see stuff like
{noformat}
INFO  [GossipStage:1] 2018-12-21 16:30:41,202 
GossipDigestSynVerbHandler.java:104 - sending [] digests and 
{127.0.0.2:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545427529, version = 2147483647, AppStateMap = {}} deltas
java.lang.Throwable: null
at 
org.apache.cassandra.gms.GossipDigestSynVerbHandler.doVerb(GossipDigestSynVerbHandler.java:104)
at 
org.apache.cassandra.net.MessageDeliveryTask.process(MessageDeliveryTask.java:92)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:54)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
DEBUG [GossipStage:1] 2018-12-21 16:30:41,202 
OutboundMessagingConnection.java:257 - connection attempt 4 to 127.0.0.2:7000 
(GOSSIP)
DEBUG [GossipStage:1] 2018-12-21 16:30:41,202 NettyFactory.java:326 - creating 
outbound bootstrap to peer 127.0.0.2:7000, compression: false, encryption: 
enabled (jdk), coalesce: DISABLED, protocolVersion: 11
{noformat}

and then later 
{noformat}
DEBUG [GossipStage:1] 2018-12-21 16:30:59,320 Gossiper.java:1390 - Shadow 
request received, adding all states, {127.0.0.3:7000=EndpointState: 
HeartBeatState = HeartBeat: generation = 1545427545, version = 360, AppStateMap 
= {STATUS=Value(NORMAL,3074457345618258602,18), LOAD=Value(1.1576835E7,349), 
SCHEMA=Value(e8a984be-18d3-3322-b20c-036da60b86c9,46), DC=Value(datacenter1,6), 
RACK=Value(rack1,8), RELEASE_VERSION=Value(3.11.3-SNAPSHOT,4), 
RPC_ADDRESS=Value(127.0.0.3,3), NET_VERSION=Value(11,1), 
HOST_ID=Value(e65cd54b-a568-42d9-ae25-70e566167fc1,2), 
TOKENS=Value(^@^@^@^H*ªªª^@^@^@^@,17), RPC_READY=Value(true,32)}, 
127.0.0.2:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545427529, version = 2147483647, AppStateMap = 
{STATUS=Value(shutdown,true,132), LOAD=Value(1.1576468E7,347), 
SCHEMA=Value(e8a984be-18d3-3322-b20c-036da60b86c9,62), DC=Value(datacenter1,6), 
RACK=Value(rack1,8), RELEASE_VERSION=Value(3.11.3-SNAPSHOT,4), 
RPC_ADDRESS=Value(127.0.0.2,3), NET_VERSION=Value(11,1), 
HOST_ID=Value(b14298c2-4322-49de-9747-407dc5f16bb6,2), 
TOKENS=Value(^@^@^@^HÕUUU^@^@^@^@,17), RPC_READY=Value(false,133), 
STATUS_WITH_PORT=Value(shutdown,true,131)}, 127.0.0.1:7000=EndpointState: 
HeartBeatState = HeartBeat: generation = 1545427751, version = 152, AppStateMap 
= {STATUS=Value(NORMAL,-9223372036854775808,31), LOAD=Value(4494006.0,104), 
SCHEMA=Value(3b77100c-2a8c-318f-a9e0-c5bfc4a4bde4,42), DC=Value(datacenter1,7), 
RACK=Value(rack1,9), RELEASE_VERSION=Value(4.0-SNAPSHOT,5), 
RPC_ADDRESS=Value(127.0.0.1,4), NET_VERSION=Value(12,1), 
HOST_ID=Value(01625b7d-2315-47e8-b031-da3cd9382161,2), 
TOKENS=Value(^@^@^@^H<80>^@^@^@^@^@^@^@^@^@^@^@,29), RPC_READY=Value(true,54), 
NATIVE_ADDRESS_AND_PORT=Value(127.0.0.1:9042,3), 
STATUS_WITH_PORT=Value(NORMAL,-9223372036854775808,30)}}
INFO  [GossipStage:1] 2018-12-21 16:30:59,322 
GossipDigestSynVerbHandler.java:104 - sending [] digests and 
{127.0.0.3:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545427545, version = 360, AppStateMap = 
{STATUS=Value(NORMAL,3074457345618258602,18), LOAD=Value(1.1576835E7,349), 
SCHEMA=Value(e8a984be-18d3-3322-b20c-036da60b86c9,46), DC=Value(datacenter1,6), 
RACK=Value(rack1,8), RELEASE_VERSION=Value(3.11.3-SNAPSHOT,4), 
RPC_ADDRESS=Value(127.0.0.3,3), NET_VERSION=Value(11,1), 
HOST_ID=Value(e65cd54b-a568-42d9-ae25-70e566167fc1,2), 
TOKENS=Value(^@^@^@^H*ªªª^@^@^@^@,17), RPC_READY=Value(true,32)}, 
127.0.0.2:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545427529, version = 2147483647, AppStateMap = 
{STATUS=Value(shutdown,true,132), LOAD=Value(1.1576468E7,347), 

[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-12-21 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726920#comment-16726920
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


Got the endpoint state map + some additional logging during the shadow round.

{noformat}
java.lang.AssertionError: previous (Gossip ApplicationState.HOST_ID for this 
node) is null: Messages = Starting shadow gossip round to check for endpoint 
collision at 127.0.0.2:7000
Sending shadow round GOSSIP DIGEST SYN to seeds [127.0.0.1:7000, 127.0.0.3:7000]
Received a regular ack from 127.0.0.1:7000, can now exit shadow round, epStates 
= {127.0.0.2:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
154535, version = 2147483647, AppStateMap = {}, 
127.0.0.1:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545333481, version = 123, AppStateMap = {}}
at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:834)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:556)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:835)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:693)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:644)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:369)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:581)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:670)
{noformat}

{noformat}
java.lang.AssertionError: previous (Gossip ApplicationState.HOST_ID for this 
node) is null: Messages = Starting shadow gossip round to check for endpoint 
collision at 127.0.0.2:7000
Sending shadow round GOSSIP DIGEST SYN to seeds [127.0.0.1:7000, 127.0.0.3:7000]
Received a regular ack from 127.0.0.1:7000, can now exit shadow round, 
epStateMap {127.0.0.3:7000=EndpointState: HeartBeatState = HeartBeat: 
generation = 1545410617, version = 240, AppStateMap = {}, 
127.0.0.2:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545410603, version = 2147483647, AppStateMap = {}, 
127.0.0.1:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545410741, version = 123, AppStateMap = {}}, epStates = 
{127.0.0.3:7000=EndpointState: HeartBeatState = HeartBeat: generation = 
1545410617, version = 240, AppStateMap = {}, 127.0.0.2:7000=EndpointState: 
HeartBeatState = HeartBeat: generation = 1545410603, version = 2147483647, 
AppStateMap = {}, 127.0.0.1:7000=EndpointState: HeartBeatState = HeartBeat: 
generation = 1545410741, version = 123, AppStateMap = {}}
at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:834)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:556)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:835)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:693)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:644)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:369)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:581)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:670)
{noformat}

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle, Testing
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>  Labels: dtest
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> 

[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-12-20 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726138#comment-16726138
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


This is failing in the rolling upgrade case. [There is a 60 second sleep in the 
test between every node upgrade to the newer 
version.|https://github.com/apache/cassandra-dtest/blob/master/upgrade_tests/upgrade_through_versions_test.py#L344]
 It seems unlikely A or B would not have gossiped with C before C is upgraded.

I got the entire endpoint state map.
{noformat}
ERROR [main] 2018-12-20 18:34:12,868 CassandraDaemon.java:692 - Exception 
encountered during startup
java.lang.AssertionError: previous (Gossip ApplicationState.HOST_ID for this 
node) is null: {127.0.0.2:7000=EndpointState: HeartBeatState = HeartBeat: 
generation = 1545330606, version = 2147483647, AppStateMap = {}}
at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:833)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:551)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:830)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:688)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:639)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:369)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:581)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:670)
{noformat}

I am going to add more logging. 

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle, Testing
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>  Labels: dtest
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-12-20 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726102#comment-16726102
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


I had it log the endpoint state. I should have done the entire thing.
{noformat}
17:58:15,275 conftest ERROR Unexpected error in node2 log, error: 
ERROR [main] 2018-12-20 17:54:19,939 CassandraDaemon.java:692 - Exception 
encountered during startup
java.lang.AssertionError: previous (Gossip ApplicationState.HOST_ID for this 
node) is null: EndpointState: HeartBeatState = HeartBeat: generation = 
1545328265, version = 2147483647, AppStateMap = {}
at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:833)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:551)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:830)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:688)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:639)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:369)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:581)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:670)
{noformat}

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle, Testing
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>  Labels: dtest
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-12-20 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726072#comment-16726072
 ] 

Ariel Weisberg commented on CASSANDRA-14155:


I am debugging this right now. I can reproduce it fairly reliably with the 
upgrade tests. https://circleci.com/gh/aweisberg/cassandra/2301

Containers 5, 31, 35 all failed.

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle, Testing
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>  Labels: dtest
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-03-29 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419003#comment-16419003
 ] 

Jason Brown commented on CASSANDRA-14155:
-

bq. it would not include any state for that peer as the generation/version in 
the digest would match the one in the local epState

The SYN from the peer in a shadow round has an empty digest list; see 
{{Gossiper#doShadowRound()}} where the {{gDigests}} is created but nothing 
added to it. Thus, I believe node B from my example would send a list of 
{{InetAddress}} es with no {{ApplicationStates}}.

Either way, distributed data races are hard :) and I'm not sure we need to beat 
this horse further without more evidence.

bq. So I'm all for adding the check & assertion error in isSafeForStartup, 
although I think we ought to log more detail here, probably the epStates map in 
its entireity. I'm less comfortable with changing the behaviour of the shadow 
round if we're not really clear on what's causing it.

This is reasonable. Since we're adding logging for now for debug purposes, 
should I add to both 3.11 and trunk, or just trunk? (I'm leaning toward just 
trunk, but I'm fine either way)


> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-03-20 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406751#comment-16406751
 ] 

Sam Tunnicliffe commented on CASSANDRA-14155:
-

I'm not sure that the scenario above can happen quite as described. When 
\{{loadRingState}} adds the endpoints to \{{endpointStateMap}} they're created 
with a brand new \{{HeartBeatState}}, one with \{{(generation, version) == (0, 
0)}}. In \{{Gossiper::examineGossiper}}, the empty digest list in a shadow SYN 
is replaced with a list containing one digest for every known endpoint and 
these are also initialized with {{(0,0)}}. So if a node were to finish its 
shadow round, load ring state, start gossip and immediately receive a shadow 
round SYN from a peer, it would not include any state for that peer as the 
generation/version in the digest would match the one in the local epState. 

Of course though, the stacktrace in the description certainly indicates that 
the epStates map obtained from the shadow round did contain a state for the 
node in question and that its {{HOST_ID}} appState is missing. So I'm all for 
adding the check & assertion error in {{isSafeForStartup}}, although I think we 
ought to log more detail here, probably the epStates map in its entireity. I'm 
less comfortable with changing the behaviour of the shadow round if we're not 
really clear on what's causing it. As we've only seen this sporadically in 
tests, how do you feel about adding the assertion (& any other error logging 
that may be useful) and seeing if that helps us track down the cause if/when we 
see the error in future test runs? My fear is that this is a symptom of a more 
pernicious race like the ones in CASSANDRA-13700 & CASSANDRA-11825.

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 

[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-01-25 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340341#comment-16340341
 ] 

Jason Brown commented on CASSANDRA-14155:
-

/cc [~beobal] [~jkni]

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-01-25 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340340#comment-16340340
 ] 

Jason Brown commented on CASSANDRA-14155:
-

*WHAT IS HAPPENING?*
 So, the obvious is that we aren't finding the {{HOST_ID}} in the endpoint's 
state, but where is that data coming from? With CASSANDRA-10134 (in c* 3.6), we 
began [performing a shadow round of 
gossip|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L507]
 on every bounce of a node. The shadow round data comes from any peer in the 
node's seed list. To hit the NPE state, the shadow round data provided by the 
seed must contain an entry in the Map for the node's {{InetAddress}}, but must 
not contain the {{HOST_ID}}, and as I suspeect, no A{{pplicationStates}}; see 
next section.

(Note that CASSANDRA-12653, committed to 3.11, moved the collected shadow round 
state from {{Gossiper#endpointStateMap}} to 
{{Gossiper#endpointShadowStateMap}}. However, I do not believe that will affect 
the observed behavior here).

HOW ARE WE GETTING INTO THIS STATE?
 Barring some kind of Byzantine failure, my best guess is this: assume three 
nodes, A-B-C, and C is the node that hits the NPE. C contacts it's seed nodes 
(in this example, at a minimum B), and the response from B is the first one 
processed. Given the explaination above of how C processes B's shadow round 
data, I think B itself has just left it's own shadow round (by getting a 
response back to it's own shadow round, which assumably comes from A in this 
exmaple).

Then, on B:
 - in {{StorageService#prepareToJoin()}}, we 
[{{loadRingState()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L788].
 This will insert into C's {{Gossiper#endpointStateMap}} all of the peers (via 
{{InetAddress}}) that we knew about before the bounce. NOTE: we do not add in 
any previously known {{ApplicationState}}s. Thus, {{Gossiper#endpointStateMap}} 
contains {{InetAddress}} es which point to 'empty' {{EndpointState}} s (no 
populated {{ApplicationState}} s).
 - We then start the 
[{{Gossiper}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L792],
 after which we will start processing any incoming gossip message.
 - If the first incoming gossip message is a SYN from C, we will happily send 
back everything we know about the cluster. In the case of B, which has just 
bounced, it basically only knows the {{InetAddress}} es, of peers - no 
{{ApplicationStates}}

Then, C gets back the (more or less) empty gossip data from B, and because it 
["sees" it's own address in that 
shadowRoundData|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L766],
 it assumes it should also see metadata ({{ApplicationState}} s) about it 
itself. That's when it looks up the {{HOST_ID}}, and [naively tries to 
dereference 
it|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L788]
 - causing the NPE in this case.

*SOLUTION:*
 I don't think we can change the distributed race on restart necessarily 
without larger structural changes, but we can change how a node determines if 
it can exit the shadow round. As we're basically only checking for a previous 
{{HOST_ID}} for the current node in the shadow round data, I propose we add a 
check to {{Gossiper@maybeFinishShadowRound()}} that, in addtion to the existing 
checks, loks if the data contains the {{HOST_ID}} for the current node. If so, 
exit the shadow round as usual; else, keep waiting for a more complete set of 
gossip data.
||14155||
|[branch|https://github.com/jasobrown/cassandra/tree/14155]|
|[utests & 
dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14155]|

For convenience, here's comparison against trunk (obviosuly, ignore the 
circleci yaml): [compare against 
trunk|https://github.com/apache/cassandra/compare/trunk...jasobrown:14155]

NOTE: this patch is against trunk, but I think we'll also need it for 3.11

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>Priority: Major
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error 

[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-01-09 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319636#comment-16319636
 ] 

Michael Kjellman commented on CASSANDRA-14155:
--

i've seen it on a few tests now. [~jasobrown] did a bunch of debugging this 
morning and last update i heard from him was he knew *why* it was happening but 
doesn't know how we get into the state in the first place.

FWIW we've seen a very similar stack in production with 2.1 -- so i think the 
big question now is if this is something unique to trunk, 3.0+, or just again 
exposing the fact that gossip still remains a racy hell-hole in 2018.

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14155) [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with dtests at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)

2018-01-09 Thread Kurt Greaves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319631#comment-16319631
 ] 

Kurt Greaves commented on CASSANDRA-14155:
--

Any more context around when this happens? AFAICT the only way this can happen 
is if "null" accidentally got written as the local hosts ID, which seems 
unlikely for a new node but possible for a node that is replacing.

Can you check if this has occurred on tests that are replacing nodes?

> [TRUNK] Gossiper somewhat frequently hitting an NPE on node startup with 
> dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> 
>
> Key: CASSANDRA-14155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14155
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Kjellman
>Assignee: Jason Brown
>
> Gossiper is somewhat frequently hitting an NPE on node startup with dtests at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769)
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - Exception 
> encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na], ERROR [main] 2018-01-08 21:41:01,832 CassandraDaemon.java:675 - 
> Exception encountered during startup
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:769) 
> ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:511)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:761)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:621)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:568)
>  ~[main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:360) 
> [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
>  [main/:na]
> at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:658) 
> [main/:na]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org