[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

Brandon Williams (JIRA) Tue, 11 Feb 2014 11:15:19 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898166#comment-13898166
 ]


Brandon Williams commented on CASSANDRA-6590:
---------------------------------------------

Hmm, so I went to test v4 and this time the ring weirdness came back, so 
perhaps it's just intermittent, but here's what it looks like.  There are three 
nodes, 10.208.8.123, 10.208.35.225, and 10.208.8.63.  123 is the seed, all 
nodes were started at the same time, and 63 is blocked from 225.  The log from 
123 looks normal:

{noformat}
 INFO 19:04:26,900 Handshaking version with /10.208.8.63
 INFO 19:04:26,913 Node /10.208.8.63 is now part of the cluster
 INFO 19:04:26,923 Handshaking version with /10.208.8.63
 INFO 19:04:26,963 Node bw-1/10.208.8.123 state jump to normal
 INFO 19:04:27,004 Startup completed! Now serving reads.
 INFO 19:04:27,076 Waiting for gossip to settle before accepting client 
requests...
 INFO 19:04:27,091 Handshaking version with /10.208.35.225
 INFO 19:04:27,097 Compacted 4 sstables to 
[/var/lib/cassandra/data/system/local/system-local-jb-5,].  5,846 bytes to 
5,684 (~97% of original) in 250ms = 0.021683MB/s.  4 total partitions merged to 
1.  Partition merge counts were {4:1, }
 INFO 19:04:27,100 Node /10.208.35.225 is now part of the cluster
 INFO 19:04:27,102 Handshaking version with /10.208.35.225
 INFO 19:04:35,190 Starting listening for CQL clients on 
bw-1/10.208.8.123:9042...
 INFO 19:04:35,252 Using TFramedTransport with a max frame size of 15728640 
bytes.
 INFO 19:04:35,253 Binding thrift service to bw-1/10.208.8.123:9160
 INFO 19:04:35,262 Using synchronous/threadpool thrift server on bw-1 : 9160
 INFO 19:04:35,262 Listening for thrift clients...
{noformat}

And it can see the other nodes in status, but can't tell their state:

{noformat}
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                 
              Rack
UN  10.208.8.123   40.9 KB    256     68.1%             
fa02838d-c39b-4d44-90db-f21a359deb12  rack1
?N  10.208.8.63    40.85 KB   256     63.0%             
90e71b90-9b41-4482-9521-71ba479c964e  rack1
?N  10.208.35.225  40.93 KB   256     68.9%             
e2fe818d-5d6c-47f9-8015-4580254cb91f  rack1
{noformat}


The other two nodes can't even see anything but themselves:

{noformat}
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                 
              Rack
UN  10.208.35.225  40.93 KB   256     100.0%            
e2fe818d-5d6c-47f9-8015-4580254cb91f  rack1

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                   
            Rack
UN  10.208.8.63  40.85 KB   256     100.0%            
90e71b90-9b41-4482-9521-71ba479c964e  rack1
{noformat}

Even though both are fully connected to the seed:

{noformat}
tcp        0      0 10.208.8.123:57215      10.208.8.63:7000        ESTABLISHED 
16517/java      
tcp        0      0 10.208.8.123:7000       10.208.8.63:37973       ESTABLISHED 
16517/java      
tcp        0      0 10.208.8.123:7000       10.208.35.225:41926     ESTABLISHED 
16517/java      
tcp        0      0 10.208.8.123:59308      10.208.35.225:7000      ESTABLISHED 
16517/java
{noformat}

I'm not sure what's going on here yet.

> Gossip does not heal after a temporary partition at startup
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-6590
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Vijay
>             Fix For: 2.0.6
>
>         Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

Reply via email to