[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2017-12-05 Thread Yap Sok Ann (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278454#comment-16278454
 ] 

Yap Sok Ann commented on CASSANDRA-6590:


We just saw this in a cluster running 2.1.19. Maybe the behavior as described 
is not fixed yet?

> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.11
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-02-11 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898166#comment-13898166
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

Hmm, so I went to test v4 and this time the ring weirdness came back, so 
perhaps it's just intermittent, but here's what it looks like.  There are three 
nodes, 10.208.8.123, 10.208.35.225, and 10.208.8.63.  123 is the seed, all 
nodes were started at the same time, and 63 is blocked from 225.  The log from 
123 looks normal:

{noformat}
 INFO 19:04:26,900 Handshaking version with /10.208.8.63
 INFO 19:04:26,913 Node /10.208.8.63 is now part of the cluster
 INFO 19:04:26,923 Handshaking version with /10.208.8.63
 INFO 19:04:26,963 Node bw-1/10.208.8.123 state jump to normal
 INFO 19:04:27,004 Startup completed! Now serving reads.
 INFO 19:04:27,076 Waiting for gossip to settle before accepting client 
requests...
 INFO 19:04:27,091 Handshaking version with /10.208.35.225
 INFO 19:04:27,097 Compacted 4 sstables to 
[/var/lib/cassandra/data/system/local/system-local-jb-5,].  5,846 bytes to 
5,684 (~97% of original) in 250ms = 0.021683MB/s.  4 total partitions merged to 
1.  Partition merge counts were {4:1, }
 INFO 19:04:27,100 Node /10.208.35.225 is now part of the cluster
 INFO 19:04:27,102 Handshaking version with /10.208.35.225
 INFO 19:04:35,190 Starting listening for CQL clients on 
bw-1/10.208.8.123:9042...
 INFO 19:04:35,252 Using TFramedTransport with a max frame size of 15728640 
bytes.
 INFO 19:04:35,253 Binding thrift service to bw-1/10.208.8.123:9160
 INFO 19:04:35,262 Using synchronous/threadpool thrift server on bw-1 : 9160
 INFO 19:04:35,262 Listening for thrift clients...
{noformat}

And it can see the other nodes in status, but can't tell their state:

{noformat}
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens  Owns (effective)  Host ID 
  Rack
UN  10.208.8.123   40.9 KB256 68.1% 
fa02838d-c39b-4d44-90db-f21a359deb12  rack1
?N  10.208.8.6340.85 KB   256 63.0% 
90e71b90-9b41-4482-9521-71ba479c964e  rack1
?N  10.208.35.225  40.93 KB   256 68.9% 
e2fe818d-5d6c-47f9-8015-4580254cb91f  rack1
{noformat}


The other two nodes can't even see anything but themselves:

{noformat}
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens  Owns (effective)  Host ID 
  Rack
UN  10.208.35.225  40.93 KB   256 100.0%
e2fe818d-5d6c-47f9-8015-4580254cb91f  rack1

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens  Owns (effective)  Host ID   
Rack
UN  10.208.8.63  40.85 KB   256 100.0%
90e71b90-9b41-4482-9521-71ba479c964e  rack1
{noformat}

Even though both are fully connected to the seed:

{noformat}
tcp0  0 10.208.8.123:57215  10.208.8.63:7000ESTABLISHED 
16517/java  
tcp0  0 10.208.8.123:7000   10.208.8.63:37973   ESTABLISHED 
16517/java  
tcp0  0 10.208.8.123:7000   10.208.35.225:41926 ESTABLISHED 
16517/java  
tcp0  0 10.208.8.123:59308  10.208.35.225:7000  ESTABLISHED 
16517/java
{noformat}

I'm not sure what's going on here yet.

> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.6
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-02-08 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895835#comment-13895835
 ] 

Vijay commented on CASSANDRA-6590:
--

Hi Brandon, 
Was not able to reproduce the above issue... (below is the log after network 
partition)
{code}
 INFO [GossipTasks:1] 2014-02-09 05:29:10,259 Gossiper.java (line 862) 
InetAddress /17.198.227.155 is now DOWN
 INFO [HANDSHAKE-/17.198.227.155] 2014-02-09 05:29:18,023 
OutboundTcpConnection.java (line 386) Handshaking version with /17.198.227.155
 INFO [RequestResponseStage:33] 2014-02-09 05:29:18,038 Gossiper.java (line 
848) InetAddress /17.198.227.155 is now UP
{code}

{quote}
I think we'll need a separate yaml option
{quote}
Done

{quote}
I'm not sure why the block in handleMajorStateChange moved
{quote}
Since the message was wrong, Up doesn't happen until echo completes, any ways i 
reverted that.

rebased @ https://github.com/Vijay2win/cassandra/tree/6590-v4




> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.6
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-02-07 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894980#comment-13894980
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

bq. the if (!localState.isAlive()) check is problematic, because while it got 
rid of the repeated UP messages, it also seemed to introduce a race situation

Nevermind that part, it was something else.

I'm still seeing repeated messages when healing the partition though:

{noformat}
 INFO 20:16:56,176 Handshaking version with /10.208.8.63
 INFO 20:16:56,186 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,187 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,187 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,190 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,190 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,190 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,191 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,191 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,191 InetAddress /10.208.8.63 is now UP
 INFO 20:16:56,193 InetAddress /10.208.8.63 is now UP
{noformat}

What I mentioned before about the block in handleMajorStateChange and the yaml 
option still applies.

> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.6
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-02-05 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892699#comment-13892699
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

I'm not sure why the block in handleMajorStateChange, but because the endpoint 
state is added before that the check for it will never be null, so it always 
says the node restarted (and we should keep the 'UP' message there to keep it 
easy to look for) even though it's the first time it's been seen.

I think the if (!localState.isAlive()) check is problematic, because while it 
got rid of the repeated UP messages, it also seem to introduce a race situation 
where sometimes some nodes would end up in a cluster by themselves.  I briefly 
tried making Echo verbs droppable in CASSANDRA-6661 instead, but that didn't 
help, so I'm not sure why we're seemingly building these requests up, or if 
something else is making realMarkAlive fire so much.

Finally, I think we'll need a separate yaml option, since removing things in a 
minor is kind of mean to upgraders who don't catch it and their server won't 
start.



> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.6
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-02-03 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889826#comment-13889826
 ] 

Vijay commented on CASSANDRA-6590:
--

Sorry was shooting a different message during the startup, fixed and pushed to 
https://github.com/Vijay2win/cassandra/tree/6590-v3. Thanks!



> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.6
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-01-29 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885489#comment-13885489
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

Hmm, this didn't actually change any of the logging for me.

> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.5
>
> Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 
> 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-01-24 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881339#comment-13881339
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

Nevermind, the conflict was trivial.  Patch works, but causes a flap during 
initial startup, and then repeats the UP message when the partition heals:

{noformat}
 INFO 19:21:10,451 Node cassandra-3/10.179.111.137 state jump to normal
 INFO 19:21:10,472 Startup completed! Now serving reads.
 INFO 19:21:10,475 waiting for gossip to settle before accepting client 
requests...
 INFO 19:21:10,660 Handshaking version with cassandra-1/10.179.65.102
 INFO 19:21:11,633 Node /10.179.64.227 is now part of the cluster
 INFO 19:21:11,635 InetAddress /10.179.64.227 is now DOWN
 INFO 19:21:11,706 Node /10.179.65.102 is now part of the cluster
 INFO 19:21:11,707 Handshaking version with cassandra-1/10.179.65.102
 INFO 19:21:11,743 InetAddress /10.179.65.102 is now UP
 INFO 19:21:12,639 InetAddress /10.179.65.102 is now DOWN
 INFO 19:21:12,644 Handshaking version with cassandra-1/10.179.65.102
 INFO 19:21:12,648 InetAddress /10.179.65.102 is now UP
 INFO 19:21:18,476 gossip settled after 0 extra polls; proceeding
 INFO 19:21:18,589 Starting listening for CQL clients on 
cassandra-3/10.179.111.137:9042...
 INFO 19:21:18,657 Using TFramedTransport with a max frame size of 15728640 
bytes.
 INFO 19:21:18,660 Binding thrift service to cassandra-3/10.179.111.137:9160
 INFO 19:21:18,672 Using synchronous/threadpool thrift server on cassandra-3 : 
9160
 INFO 19:21:18,673 Listening for thrift clients...
 INFO 19:22:02,853 Handshaking version with /10.179.64.227
 INFO 19:22:03,844 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,845 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,846 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,844 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,859 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,860 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,860 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,859 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,859 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,861 InetAddress /10.179.64.227 is now UP
 INFO 19:22:03,860 InetAddress /10.179.64.227 is now UP
{noformat}



> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.5
>
> Attachments: 0001-CASSANDRA-6590.patch, 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-01-24 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881321#comment-13881321
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

Not sure what version this is against, but it needs a rebase for 2.0.

> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.5
>
> Attachments: 0001-CASSANDRA-6590.patch, 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

2014-01-19 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876166#comment-13876166
 ] 

Brandon Williams commented on CASSANDRA-6590:
-

bq. Nit: do_firewall_check is true by default in the yaml but is false in 
config.

I did that on purpose, so someone who isn't diffing the yamls between minors 
doesn't run into any problems with it, and presumably no firewall problems are 
going to materialize during a rolling restart.

I'll take a look at the patch when I have more time.

> Gossip does not heal after a temporary partition at startup
> ---
>
> Key: CASSANDRA-6590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Vijay
> Fix For: 2.0.5
>
> Attachments: 0001-CASSANDRA-6590.patch, 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when 
> the echo command is sent, but then the partition heals, the halves of the 
> partition will never mark each other up despite being able to communicate.  
> This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)