[ 
https://issues.apache.org/jira/browse/CASSANDRA-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500545#comment-14500545
 ] 

Brandon Williams edited comment on CASSANDRA-8072 at 4/19/15 6:39 PM:
----------------------------------------------------------------------

After deep packet inspection, I believe I've found the root non-reconnectable 
snitch part of this issue.  When you decom a node, it never correctly tears 
down its ITC pools, which leaves the other side with a dead OTC pool:

{noformat}
tcp        1      0 10.208.8.123:33441      10.208.8.63:7000        CLOSE_WAIT  
18401/java      
{noformat}

Now when you try to bootstrap with the same IP, the shadow syn is correctly 
sent and the ack reply is built and queued, but MS tries to use the now defunct 
OTC pool and the message never makes it back to the node, since it just sends 
TCP RSTs which finally kills the connection.  But since the gossip syn is only 
sent once, the seed has nothing else to send the node and never reestablishes 
the connection, leaving the bootstrapping node thinking it never talked to a 
seed and throwing this error.


was (Author: brandon.williams):
After deep packet inspection, I believe I've found the root non-reconnectable 
snitch part of this issue.  When you decom a node, it never correctly tears 
down its ITC pools, which leaves the other side with a dead OTC pool:

{noformat}
tcp        1      0 10.208.8.123:33441      10.208.8.63:7000        CLOSE_WAIT  
18401/java      
{noformat}

Now when you try to bootstrap with the same IP, the shadow syn is correctly 
sent and the ack reply is built and queued, but MS tries to use the now default 
OTC pool and the message never makes it back to the node, since it just sends 
RSTs which finally kills the connection.  But since the syn is only sent once, 
the seed has nothing else to send the node and never reestablishes the 
connection, leaving the bootstrapping node thinking it never talked to a seed 
and throwing this error.

> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8072
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan Springer
>            Assignee: Brandon Williams
>             Fix For: 2.0.15, 2.1.5
>
>         Attachments: cas-dev-dt-01-uw1-cassandra-seed01_logs.tar.bz2, 
> cas-dev-dt-01-uw1-cassandra-seed02_logs.tar.bz2, 
> cas-dev-dt-01-uw1-cassandra02_logs.tar.bz2, 
> casandra-system-log-with-assert-patch.log, trace_logs.tar.bz2
>
>
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 cluster 
> in either ec2 or locally, an error occurs sometimes with one of the nodes 
> refusing to start C*.  The error in the /var/log/cassandra/system.log is:
> ERROR [main] 2014-10-06 15:54:52,292 CassandraDaemon.java (line 513) 
> Exception encountered during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1200)
>         at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:444)
>         at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:655)
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:609)
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 Gossiper.java 
> (line 1279) Announcing shutdown
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326 
> MessagingService.java (line 701) Waiting for messaging service to quiesce
>  INFO [ACCEPT-localhost/127.0.0.1] 2014-10-06 15:54:54,327 
> MessagingService.java (line 941) MessagingService has terminated the accept() 
> thread
> This errors does not always occur when provisioning a 2-node cluster, but 
> probably around half of the time on only one of the nodes.  I haven't been 
> able to reproduce this error with DSC 2.0.9, and there have been no code or 
> definition file changes in Opscenter.
> I can reproduce locally with the above steps.  I'm happy to test any proposed 
> fixes since I'm the only person able to reproduce reliably so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to