[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278454#comment-16278454 ] Yap Sok Ann commented on CASSANDRA-6590: We just saw this in a cluster running 2.1.19. Maybe the behavior as described is not fixed yet? > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.11 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898166#comment-13898166 ] Brandon Williams commented on CASSANDRA-6590: - Hmm, so I went to test v4 and this time the ring weirdness came back, so perhaps it's just intermittent, but here's what it looks like. There are three nodes, 10.208.8.123, 10.208.35.225, and 10.208.8.63. 123 is the seed, all nodes were started at the same time, and 63 is blocked from 225. The log from 123 looks normal: {noformat} INFO 19:04:26,900 Handshaking version with /10.208.8.63 INFO 19:04:26,913 Node /10.208.8.63 is now part of the cluster INFO 19:04:26,923 Handshaking version with /10.208.8.63 INFO 19:04:26,963 Node bw-1/10.208.8.123 state jump to normal INFO 19:04:27,004 Startup completed! Now serving reads. INFO 19:04:27,076 Waiting for gossip to settle before accepting client requests... INFO 19:04:27,091 Handshaking version with /10.208.35.225 INFO 19:04:27,097 Compacted 4 sstables to [/var/lib/cassandra/data/system/local/system-local-jb-5,]. 5,846 bytes to 5,684 (~97% of original) in 250ms = 0.021683MB/s. 4 total partitions merged to 1. Partition merge counts were {4:1, } INFO 19:04:27,100 Node /10.208.35.225 is now part of the cluster INFO 19:04:27,102 Handshaking version with /10.208.35.225 INFO 19:04:35,190 Starting listening for CQL clients on bw-1/10.208.8.123:9042... INFO 19:04:35,252 Using TFramedTransport with a max frame size of 15728640 bytes. INFO 19:04:35,253 Binding thrift service to bw-1/10.208.8.123:9160 INFO 19:04:35,262 Using synchronous/threadpool thrift server on bw-1 : 9160 INFO 19:04:35,262 Listening for thrift clients... {noformat} And it can see the other nodes in status, but can't tell their state: {noformat} Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns (effective) Host ID Rack UN 10.208.8.123 40.9 KB256 68.1% fa02838d-c39b-4d44-90db-f21a359deb12 rack1 ?N 10.208.8.6340.85 KB 256 63.0% 90e71b90-9b41-4482-9521-71ba479c964e rack1 ?N 10.208.35.225 40.93 KB 256 68.9% e2fe818d-5d6c-47f9-8015-4580254cb91f rack1 {noformat} The other two nodes can't even see anything but themselves: {noformat} Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns (effective) Host ID Rack UN 10.208.35.225 40.93 KB 256 100.0% e2fe818d-5d6c-47f9-8015-4580254cb91f rack1 Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.208.8.63 40.85 KB 256 100.0% 90e71b90-9b41-4482-9521-71ba479c964e rack1 {noformat} Even though both are fully connected to the seed: {noformat} tcp0 0 10.208.8.123:57215 10.208.8.63:7000ESTABLISHED 16517/java tcp0 0 10.208.8.123:7000 10.208.8.63:37973 ESTABLISHED 16517/java tcp0 0 10.208.8.123:7000 10.208.35.225:41926 ESTABLISHED 16517/java tcp0 0 10.208.8.123:59308 10.208.35.225:7000 ESTABLISHED 16517/java {noformat} I'm not sure what's going on here yet. > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.6 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895835#comment-13895835 ] Vijay commented on CASSANDRA-6590: -- Hi Brandon, Was not able to reproduce the above issue... (below is the log after network partition) {code} INFO [GossipTasks:1] 2014-02-09 05:29:10,259 Gossiper.java (line 862) InetAddress /17.198.227.155 is now DOWN INFO [HANDSHAKE-/17.198.227.155] 2014-02-09 05:29:18,023 OutboundTcpConnection.java (line 386) Handshaking version with /17.198.227.155 INFO [RequestResponseStage:33] 2014-02-09 05:29:18,038 Gossiper.java (line 848) InetAddress /17.198.227.155 is now UP {code} {quote} I think we'll need a separate yaml option {quote} Done {quote} I'm not sure why the block in handleMajorStateChange moved {quote} Since the message was wrong, Up doesn't happen until echo completes, any ways i reverted that. rebased @ https://github.com/Vijay2win/cassandra/tree/6590-v4 > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.6 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894980#comment-13894980 ] Brandon Williams commented on CASSANDRA-6590: - bq. the if (!localState.isAlive()) check is problematic, because while it got rid of the repeated UP messages, it also seemed to introduce a race situation Nevermind that part, it was something else. I'm still seeing repeated messages when healing the partition though: {noformat} INFO 20:16:56,176 Handshaking version with /10.208.8.63 INFO 20:16:56,186 InetAddress /10.208.8.63 is now UP INFO 20:16:56,187 InetAddress /10.208.8.63 is now UP INFO 20:16:56,187 InetAddress /10.208.8.63 is now UP INFO 20:16:56,190 InetAddress /10.208.8.63 is now UP INFO 20:16:56,190 InetAddress /10.208.8.63 is now UP INFO 20:16:56,190 InetAddress /10.208.8.63 is now UP INFO 20:16:56,191 InetAddress /10.208.8.63 is now UP INFO 20:16:56,191 InetAddress /10.208.8.63 is now UP INFO 20:16:56,191 InetAddress /10.208.8.63 is now UP INFO 20:16:56,193 InetAddress /10.208.8.63 is now UP {noformat} What I mentioned before about the block in handleMajorStateChange and the yaml option still applies. > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.6 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892699#comment-13892699 ] Brandon Williams commented on CASSANDRA-6590: - I'm not sure why the block in handleMajorStateChange, but because the endpoint state is added before that the check for it will never be null, so it always says the node restarted (and we should keep the 'UP' message there to keep it easy to look for) even though it's the first time it's been seen. I think the if (!localState.isAlive()) check is problematic, because while it got rid of the repeated UP messages, it also seem to introduce a race situation where sometimes some nodes would end up in a cluster by themselves. I briefly tried making Echo verbs droppable in CASSANDRA-6661 instead, but that didn't help, so I'm not sure why we're seemingly building these requests up, or if something else is making realMarkAlive fire so much. Finally, I think we'll need a separate yaml option, since removing things in a minor is kind of mean to upgraders who don't catch it and their server won't start. > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.6 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889826#comment-13889826 ] Vijay commented on CASSANDRA-6590: -- Sorry was shooting a different message during the startup, fixed and pushed to https://github.com/Vijay2win/cassandra/tree/6590-v3. Thanks! > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.6 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885489#comment-13885489 ] Brandon Williams commented on CASSANDRA-6590: - Hmm, this didn't actually change any of the logging for me. > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.5 > > Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, > 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881339#comment-13881339 ] Brandon Williams commented on CASSANDRA-6590: - Nevermind, the conflict was trivial. Patch works, but causes a flap during initial startup, and then repeats the UP message when the partition heals: {noformat} INFO 19:21:10,451 Node cassandra-3/10.179.111.137 state jump to normal INFO 19:21:10,472 Startup completed! Now serving reads. INFO 19:21:10,475 waiting for gossip to settle before accepting client requests... INFO 19:21:10,660 Handshaking version with cassandra-1/10.179.65.102 INFO 19:21:11,633 Node /10.179.64.227 is now part of the cluster INFO 19:21:11,635 InetAddress /10.179.64.227 is now DOWN INFO 19:21:11,706 Node /10.179.65.102 is now part of the cluster INFO 19:21:11,707 Handshaking version with cassandra-1/10.179.65.102 INFO 19:21:11,743 InetAddress /10.179.65.102 is now UP INFO 19:21:12,639 InetAddress /10.179.65.102 is now DOWN INFO 19:21:12,644 Handshaking version with cassandra-1/10.179.65.102 INFO 19:21:12,648 InetAddress /10.179.65.102 is now UP INFO 19:21:18,476 gossip settled after 0 extra polls; proceeding INFO 19:21:18,589 Starting listening for CQL clients on cassandra-3/10.179.111.137:9042... INFO 19:21:18,657 Using TFramedTransport with a max frame size of 15728640 bytes. INFO 19:21:18,660 Binding thrift service to cassandra-3/10.179.111.137:9160 INFO 19:21:18,672 Using synchronous/threadpool thrift server on cassandra-3 : 9160 INFO 19:21:18,673 Listening for thrift clients... INFO 19:22:02,853 Handshaking version with /10.179.64.227 INFO 19:22:03,844 InetAddress /10.179.64.227 is now UP INFO 19:22:03,845 InetAddress /10.179.64.227 is now UP INFO 19:22:03,846 InetAddress /10.179.64.227 is now UP INFO 19:22:03,844 InetAddress /10.179.64.227 is now UP INFO 19:22:03,859 InetAddress /10.179.64.227 is now UP INFO 19:22:03,860 InetAddress /10.179.64.227 is now UP INFO 19:22:03,860 InetAddress /10.179.64.227 is now UP INFO 19:22:03,859 InetAddress /10.179.64.227 is now UP INFO 19:22:03,859 InetAddress /10.179.64.227 is now UP INFO 19:22:03,861 InetAddress /10.179.64.227 is now UP INFO 19:22:03,860 InetAddress /10.179.64.227 is now UP {noformat} > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.5 > > Attachments: 0001-CASSANDRA-6590.patch, 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881321#comment-13881321 ] Brandon Williams commented on CASSANDRA-6590: - Not sure what version this is against, but it needs a rebase for 2.0. > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.5 > > Attachments: 0001-CASSANDRA-6590.patch, 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876166#comment-13876166 ] Brandon Williams commented on CASSANDRA-6590: - bq. Nit: do_firewall_check is true by default in the yaml but is false in config. I did that on purpose, so someone who isn't diffing the yamls between minors doesn't run into any problems with it, and presumably no firewall problems are going to materialize during a rolling restart. I'll take a look at the patch when I have more time. > Gossip does not heal after a temporary partition at startup > --- > > Key: CASSANDRA-6590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6590 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Vijay > Fix For: 2.0.5 > > Attachments: 0001-CASSANDRA-6590.patch, 6590_disable_echo.txt > > > See CASSANDRA-6571 for background. If a node is partitioned on startup when > the echo command is sent, but then the partition heals, the halves of the > partition will never mark each other up despite being able to communicate. > This stems from CASSANDRA-3533. -- This message was sent by Atlassian JIRA (v6.1.5#6160)