Re: postmortem on 2.2.13 scale out difficulties
I posted a bug, cassandra-15155 : https://issues.apache.org/jira/browse/CASSANDRA-15155?jql=project%20%3D%20CASSANDRA It seems VERY similar to https://issues.apache.org/jira/browse/CASSANDRA-6648 On Wed, Jun 12, 2019 at 12:14 PM Carl Mueller wrote: > And once the cluster token map formation is done, it starts bootstrap and > we get a ton of these: > > WARN [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983] > 2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 - > UnknownColumnFamilyException reading from socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=df425400-c331-11e8-8b96-4b7f4d58af68 > > And then after LOTS of those > > INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: > Starting to bootstrap... > INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream > #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for > Bootstrap > INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap > completed! for the tokens [-7314981925085449175, ... bunch of tokens... > 5499447097629838103] > > > On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller < > carl.muel...@smartthings.com> wrote: > >> One node at a time: yes that is what we are doing >> >> We have not tried the streaming_socket_timeout_in_ms. It is currently 24 >> hours. (```streaming_socket_timeout_in_ms=8640```) which would cover >> the bootstrap timeframe we have seen before (1-2 hours per node) >> >> Since it joins with no data, it is serving erroneous data. We may try >> bootstrap rejoin and the JVM_OPT The node appears to think it has >> bootstrapped even though the gossipinfo shows the new node has a different >> schema version. >> >> We had scaled EU and US from 5 --> 25 without incident (one at a time), >> and since we increased ring_delay_ms worked haphazardly to get us four >> joins, and since then failure. >> >> The debug log shows: >> >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - >> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - >> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 - >> Not pulling schema because versions match or shouldPullSchemaFrom returned >> false >> INFO [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 - >> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 >> INFO [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 - >> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 - >> Not pulling schema because versions match or shouldPullSchemaFrom returned >> false >> INFO [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node >> /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 - >> Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token >> [-1028768087263234868, .., 921670352349030554] >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1028768087263234868 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1045740236536355596 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1184422937682103096 >> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - >> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token >> -1201924032068728250 >> >> All the nodes appear to be reporting "Not pulling schema becuase versions >> match or shouldPullSchmeaFrom returned false. That code >> (MigrationManager.java) makes reference to a "gossip only" node, did we get >> stuck in that somehow. >> >> On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A wrote: >> >>> >>> >>> >>> >>> Adding one node at a time – is that successful? >>> >>> Check value of streaming_socket_timeout_in_ms parameter in >>> cassandra.yaml and increase if needed. >>> >>> Have you tried Nodetool bootstrap resume & jvm option i.e. >>> JVM_OPTS="$JVM_OPTS -Dcassandra.consist
Re: postmortem on 2.2.13 scale out difficulties
And once the cluster token map formation is done, it starts bootstrap and we get a ton of these: WARN [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983] 2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 - UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=df425400-c331-11e8-8b96-4b7f4d58af68 And then after LOTS of those INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: Starting to bootstrap... INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for Bootstrap INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap completed! for the tokens [-7314981925085449175, ... bunch of tokens... 5499447097629838103] On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller wrote: > One node at a time: yes that is what we are doing > > We have not tried the streaming_socket_timeout_in_ms. It is currently 24 > hours. (```streaming_socket_timeout_in_ms=8640```) which would cover > the bootstrap timeframe we have seen before (1-2 hours per node) > > Since it joins with no data, it is serving erroneous data. We may try > bootstrap rejoin and the JVM_OPT The node appears to think it has > bootstrapped even though the gossipinfo shows the new node has a different > schema version. > > We had scaled EU and US from 5 --> 25 without incident (one at a time), > and since we increased ring_delay_ms worked haphazardly to get us four > joins, and since then failure. > > The debug log shows: > > DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - > New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229 > DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - > New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899 > DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 - > Not pulling schema because versions match or shouldPullSchemaFrom returned > false > INFO [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 - > Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 > INFO [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 - > Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 > DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 - > Not pulling schema because versions match or shouldPullSchemaFrom returned > false > INFO [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node > /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster > DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 - > Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token > [-1028768087263234868, .., 921670352349030554] > DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - > New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token > -1028768087263234868 > DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - > New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token > -1045740236536355596 > DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - > New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token > -1184422937682103096 > DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - > New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token > -1201924032068728250 > > All the nodes appear to be reporting "Not pulling schema becuase versions > match or shouldPullSchmeaFrom returned false. That code > (MigrationManager.java) makes reference to a "gossip only" node, did we get > stuck in that somehow. > > On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A wrote: > >> >> >> >> >> Adding one node at a time – is that successful? >> >> Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml >> and increase if needed. >> >> Have you tried Nodetool bootstrap resume & jvm option i.e. >> JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false" ? >> >> >> >> >> >> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID] >> *Sent:* Wednesday, June 12, 2019 11:35 AM >> *To:* user@cassandra.apache.org >> *Subject:* Re: postmortem on 2.2.13 scale out difficulties >> >> >> >> We only were able to scale out four nodes and then failures started >> occurring, including multiple instances of nodes joining a cluster without >> streaming. >> >> >> >> Sigh. >> >> >> >> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller < >> carl.muel
Re: postmortem on 2.2.13 scale out difficulties
One node at a time: yes that is what we are doing We have not tried the streaming_socket_timeout_in_ms. It is currently 24 hours. (```streaming_socket_timeout_in_ms=8640```) which would cover the bootstrap timeframe we have seen before (1-2 hours per node) Since it joins with no data, it is serving erroneous data. We may try bootstrap rejoin and the JVM_OPT The node appears to think it has bootstrapped even though the gossipinfo shows the new node has a different schema version. We had scaled EU and US from 5 --> 25 without incident (one at a time), and since we increased ring_delay_ms worked haphazardly to get us four joins, and since then failure. The debug log shows: DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229 DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 - New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899 DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false INFO [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 - Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 INFO [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 - Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983 DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false INFO [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 - Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token [-1028768087263234868, .., 921670352349030554] DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -1028768087263234868 DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -1045740236536355596 DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -1184422937682103096 DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 - New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token -1201924032068728250 All the nodes appear to be reporting "Not pulling schema becuase versions match or shouldPullSchmeaFrom returned false. That code (MigrationManager.java) makes reference to a "gossip only" node, did we get stuck in that somehow. On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A wrote: > > > > > Adding one node at a time – is that successful? > > Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml > and increase if needed. > > Have you tried Nodetool bootstrap resume & jvm option i.e. > JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false" ? > > > > > > *From:* Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID] > *Sent:* Wednesday, June 12, 2019 11:35 AM > *To:* user@cassandra.apache.org > *Subject:* Re: postmortem on 2.2.13 scale out difficulties > > > > We only were able to scale out four nodes and then failures started > occurring, including multiple instances of nodes joining a cluster without > streaming. > > > > Sigh. > > > > On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller > wrote: > > We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, > IPV6 > > Needed to scale out the asia datacenter, which was 5 nodes, europe and us > were 25 nodes > > We were running into bootstrapping issues where the new node failed to > bootstrap/stream, it failed with > > > > "java.lang.RuntimeException: A node required to move the data consistently > is down" > > > > ...even though they were all up based on nodetool status prior to adding > the node. > > First we increased the phi_convict_threshold to 12, and that did not help. > > CASSANDRA-12281 appeared similar to what we had problems with, but I don't > think we hit that. Somewhere in there someone wrote > > > > "For us, the workaround is either deleting the data (then bootstrap > again), or increasing the ring_delay_ms. And the larger the cluster is, the > longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, > it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, > >100seconds. Default is 30seconds." > > Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and > it finally worked. > > side note: > > During the rolling restarts for setti
RE: postmortem on 2.2.13 scale out difficulties
Adding one node at a time – is that successful? Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml and increase if needed. Have you tried Nodetool bootstrap resume & jvm option i.e. JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false" ? From: Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID] Sent: Wednesday, June 12, 2019 11:35 AM To: user@cassandra.apache.org Subject: Re: postmortem on 2.2.13 scale out difficulties We only were able to scale out four nodes and then failures started occurring, including multiple instances of nodes joining a cluster without streaming. Sigh. On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller mailto:carl.muel...@smartthings.com>> wrote: We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, IPV6 Needed to scale out the asia datacenter, which was 5 nodes, europe and us were 25 nodes We were running into bootstrapping issues where the new node failed to bootstrap/stream, it failed with "java.lang.RuntimeException: A node required to move the data consistently is down" ...even though they were all up based on nodetool status prior to adding the node. First we increased the phi_convict_threshold to 12, and that did not help. CASSANDRA-12281 appeared similar to what we had problems with, but I don't think we hit that. Somewhere in there someone wrote "For us, the workaround is either deleting the data (then bootstrap again), or increasing the ring_delay_ms. And the larger the cluster is, the longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, >100seconds. Default is 30seconds." Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and it finally worked. side note: During the rolling restarts for setting phi_convict_threshold we observed quite a lot of status map variance between nodes (we have a program to poll all of a datacenter or cluster's view of the gossipinfo and statuses. AWS appears to have variance in networking based on the phi_convict_threshold advice, I'm not sure if our difficulties were typical in that regard and/or if our IPV6 and/or globally distributed datacenters were exacerbating factors. We could not reproduce this in loadtest, although loadtest is only eu and us (but is IPV6)
Re: postmortem on 2.2.13 scale out difficulties
We're getting DEBUG [GossipStage:1] 2019-06-12 15:20:07,797 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false multiple times, as it contacts the nodes. On Wed, Jun 12, 2019 at 11:35 AM Carl Mueller wrote: > We only were able to scale out four nodes and then failures started > occurring, including multiple instances of nodes joining a cluster without > streaming. > > Sigh. > > On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller > wrote: > >> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, >> IPV6 >> >> Needed to scale out the asia datacenter, which was 5 nodes, europe and us >> were 25 nodes >> >> We were running into bootstrapping issues where the new node failed to >> bootstrap/stream, it failed with >> >> "java.lang.RuntimeException: A node required to move the data >> consistently is down" >> >> ...even though they were all up based on nodetool status prior to adding >> the node. >> >> First we increased the phi_convict_threshold to 12, and that did not >> help. >> >> CASSANDRA-12281 appeared similar to what we had problems with, but I >> don't think we hit that. Somewhere in there someone wrote >> >> "For us, the workaround is either deleting the data (then bootstrap >> again), or increasing the ring_delay_ms. And the larger the cluster is, the >> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, >> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, >> >100seconds. Default is 30seconds." >> >> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and >> it finally worked. >> >> side note: >> >> During the rolling restarts for setting phi_convict_threshold we observed >> quite a lot of status map variance between nodes (we have a program to poll >> all of a datacenter or cluster's view of the gossipinfo and statuses. AWS >> appears to have variance in networking based on the phi_convict_threshold >> advice, I'm not sure if our difficulties were typical in that regard and/or >> if our IPV6 and/or globally distributed datacenters were exacerbating >> factors. >> >> We could not reproduce this in loadtest, although loadtest is only eu and >> us (but is IPV6) >> >
Re: postmortem on 2.2.13 scale out difficulties
We only were able to scale out four nodes and then failures started occurring, including multiple instances of nodes joining a cluster without streaming. Sigh. On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller wrote: > We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, > IPV6 > > Needed to scale out the asia datacenter, which was 5 nodes, europe and us > were 25 nodes > > We were running into bootstrapping issues where the new node failed to > bootstrap/stream, it failed with > > "java.lang.RuntimeException: A node required to move the data consistently > is down" > > ...even though they were all up based on nodetool status prior to adding > the node. > > First we increased the phi_convict_threshold to 12, and that did not help. > > CASSANDRA-12281 appeared similar to what we had problems with, but I don't > think we hit that. Somewhere in there someone wrote > > "For us, the workaround is either deleting the data (then bootstrap > again), or increasing the ring_delay_ms. And the larger the cluster is, the > longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, > it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, > >100seconds. Default is 30seconds." > > Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and > it finally worked. > > side note: > > During the rolling restarts for setting phi_convict_threshold we observed > quite a lot of status map variance between nodes (we have a program to poll > all of a datacenter or cluster's view of the gossipinfo and statuses. AWS > appears to have variance in networking based on the phi_convict_threshold > advice, I'm not sure if our difficulties were typical in that regard and/or > if our IPV6 and/or globally distributed datacenters were exacerbating > factors. > > We could not reproduce this in loadtest, although loadtest is only eu and > us (but is IPV6) >