[ 
https://issues.apache.org/jira/browse/CASSANDRA-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066524#comment-15066524
 ] 

Stefania commented on CASSANDRA-8072:
-------------------------------------

Building on [~brandon.williams] previous analysis but taking into account more 
recent changes where we do close sockets, the problem is still that the seed 
node is sending the ACK to the old socket, even after it has been closed by the 
decommissioned node. This is because we only send on these sockets, so we 
cannot know when they are closed until the send buffers are exceeded or unless 
we try to read from them as well. However, the problem should now only be true 
until the node is convicted, approx 10 seconds with a {{phi_convict_threshold}} 
of 8. I verified this by adding a sleep of 15 seconds in my test before 
restarting the node, and it restarted without problems. [~slowenthal] would you 
be able to confirm this with your tests?

If we cannot detect when an outgoing socket is closed by its peer, then we need 
an out-of-bound notification. This could come from the departing node 
announcing its shutdown at the end of its decommission but the existing logic 
in {{Gossiper.stop()}} prevents this for the dead states (*removing, removed, 
left and hibernate*) or for *bootstrapping*. This was introduced by 
CASSANDRA-8336 and the same problem has already been raised in CASSANDRA-9630. 
Even if we undo CASSANDRA-8336 there is then another issue: since 
CASSANDRA-9765 we can no longer join a cluster in status SHUTDOWN and I believe 
this is correct. So the answer cannot be to announce a shutdown after 
decommission, not without significant changes to the Gossip protocol. Closing 
the socket earlier, say when we get the status LEFT notification, is not 
sufficient because during the RING_DELAY sleep period we may re-establish the 
connection to the node before it dies, typically for a Gossip update. 

So I think we only have two options:

* read from outgoing sockets purely to detect when they are closed
* send a new GOSSIP flag indicating it is time to close the sockets to a node


> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8072
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Lifecycle
>            Reporter: Ryan Springer
>            Assignee: Stefania
>             Fix For: 2.1.x
>
>         Attachments: cas-dev-dt-01-uw1-cassandra-seed01_logs.tar.bz2, 
> cas-dev-dt-01-uw1-cassandra-seed02_logs.tar.bz2, 
> cas-dev-dt-01-uw1-cassandra02_logs.tar.bz2, 
> casandra-system-log-with-assert-patch.log, screenshot-1.png, 
> trace_logs.tar.bz2
>
>
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 cluster 
> in either ec2 or locally, an error occurs sometimes with one of the nodes 
> refusing to start C*.  The error in the /var/log/cassandra/system.log is:
> ERROR [main] 2014-10-06 15:54:52,292 CassandraDaemon.java (line 513) 
> Exception encountered during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1200)
>         at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:444)
>         at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:655)
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:609)
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 Gossiper.java 
> (line 1279) Announcing shutdown
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326 
> MessagingService.java (line 701) Waiting for messaging service to quiesce
>  INFO [ACCEPT-localhost/127.0.0.1] 2014-10-06 15:54:54,327 
> MessagingService.java (line 941) MessagingService has terminated the accept() 
> thread
> This errors does not always occur when provisioning a 2-node cluster, but 
> probably around half of the time on only one of the nodes.  I haven't been 
> able to reproduce this error with DSC 2.0.9, and there have been no code or 
> definition file changes in Opscenter.
> I can reproduce locally with the above steps.  I'm happy to test any proposed 
> fixes since I'm the only person able to reproduce reliably so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to