Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
I posted a bug, cassandra-15155 :
https://issues.apache.org/jira/browse/CASSANDRA-15155?jql=project%20%3D%20CASSANDRA

It seems VERY similar to
https://issues.apache.org/jira/browse/CASSANDRA-6648

On Wed, Jun 12, 2019 at 12:14 PM Carl Mueller 
wrote:

> And once the cluster token map formation is done, it starts bootstrap and
> we get a ton of these:
>
> WARN  [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983]
> 2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 -
> UnknownColumnFamilyException reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=df425400-c331-11e8-8b96-4b7f4d58af68
>
> And then after LOTS of those
>
> INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
> Starting to bootstrap...
> INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
> #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
> Bootstrap
> INFO  [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
> completed! for the tokens [-7314981925085449175, ... bunch of tokens...
> 5499447097629838103]
>
>
> On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> One node at a time: yes that is what we are doing
>>
>> We have not tried the streaming_socket_timeout_in_ms. It is currently 24
>> hours. (```streaming_socket_timeout_in_ms=8640```) which would cover
>> the bootstrap timeframe we have seen before (1-2 hours per node)
>>
>> Since it joins with no data, it is serving erroneous data. We may try
>> bootstrap rejoin and the JVM_OPT The node appears to think it has
>> bootstrapped even though the gossipinfo shows the new node has a different
>> schema version.
>>
>> We had scaled EU and US from 5 --> 25 without incident (one at a time),
>> and since we increased ring_delay_ms worked haphazardly to get us four
>> joins, and since then failure.
>>
>> The debug log shows:
>>
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
>> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
>> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 -
>> Not pulling schema because versions match or shouldPullSchemaFrom returned
>> false
>> INFO  [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 -
>> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
>> INFO  [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 -
>> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 -
>> Not pulling schema because versions match or shouldPullSchemaFrom returned
>> false
>> INFO  [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node
>> /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 -
>> Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token
>> [-1028768087263234868, .., 921670352349030554]
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1028768087263234868
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1045740236536355596
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1184422937682103096
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1201924032068728250
>>
>> All the nodes appear to be reporting "Not pulling schema becuase versions
>> match or shouldPullSchmeaFrom returned false. That code
>> (MigrationManager.java) makes reference to a "gossip only" node, did we get
>> stuck in that somehow.
>>
>> On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A  wrote:
>>
>>>
>>>
>>>
>>>
>>> Adding one node at a time – is that successful?
>>>
>>> Check value of streaming_socket_timeout_in_ms parameter in
>>> cassandra.yaml and increase if needed.
>>>
>>> Have you tried Nodetool bootstrap resume & jvm option i.e.
>>> JVM_OPTS="$JVM_OPTS -Dcassandra.consist

Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
And once the cluster token map formation is done, it starts bootstrap and
we get a ton of these:

WARN  [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983]
2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 -
UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=df425400-c331-11e8-8b96-4b7f4d58af68

And then after LOTS of those

INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
Starting to bootstrap...
INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
#05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
Bootstrap
INFO  [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
completed! for the tokens [-7314981925085449175, ... bunch of tokens...
5499447097629838103]


On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller 
wrote:

> One node at a time: yes that is what we are doing
>
> We have not tried the streaming_socket_timeout_in_ms. It is currently 24
> hours. (```streaming_socket_timeout_in_ms=8640```) which would cover
> the bootstrap timeframe we have seen before (1-2 hours per node)
>
> Since it joins with no data, it is serving erroneous data. We may try
> bootstrap rejoin and the JVM_OPT The node appears to think it has
> bootstrapped even though the gossipinfo shows the new node has a different
> schema version.
>
> We had scaled EU and US from 5 --> 25 without incident (one at a time),
> and since we increased ring_delay_ms worked haphazardly to get us four
> joins, and since then failure.
>
> The debug log shows:
>
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 -
> Not pulling schema because versions match or shouldPullSchemaFrom returned
> false
> INFO  [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 -
> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
> INFO  [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 -
> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 -
> Not pulling schema because versions match or shouldPullSchemaFrom returned
> false
> INFO  [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node
> /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 -
> Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token
> [-1028768087263234868, .., 921670352349030554]
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1028768087263234868
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1045740236536355596
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1184422937682103096
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1201924032068728250
>
> All the nodes appear to be reporting "Not pulling schema becuase versions
> match or shouldPullSchmeaFrom returned false. That code
> (MigrationManager.java) makes reference to a "gossip only" node, did we get
> stuck in that somehow.
>
> On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A  wrote:
>
>>
>>
>>
>>
>> Adding one node at a time – is that successful?
>>
>> Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml
>> and increase if needed.
>>
>> Have you tried Nodetool bootstrap resume & jvm option i.e.
>> JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false"  ?
>>
>>
>>
>>
>>
>> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID]
>> *Sent:* Wednesday, June 12, 2019 11:35 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: postmortem on 2.2.13 scale out difficulties
>>
>>
>>
>> We only were able to scale out four nodes and then failures started
>> occurring, including multiple instances of nodes joining a cluster without
>> streaming.
>>
>>
>>
>> Sigh.
>>
>>
>>
>> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller <
>> carl.muel

Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
One node at a time: yes that is what we are doing

We have not tried the streaming_socket_timeout_in_ms. It is currently 24
hours. (```streaming_socket_timeout_in_ms=8640```) which would cover
the bootstrap timeframe we have seen before (1-2 hours per node)

Since it joins with no data, it is serving erroneous data. We may try
bootstrap rejoin and the JVM_OPT The node appears to think it has
bootstrapped even though the gossipinfo shows the new node has a different
schema version.

We had scaled EU and US from 5 --> 25 without incident (one at a time), and
since we increased ring_delay_ms worked haphazardly to get us four joins,
and since then failure.

The debug log shows:

DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229
DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899
DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false
INFO  [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 -
Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
INFO  [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 -
Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false
INFO  [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node
/2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster
DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 -
Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token
[-1028768087263234868, .., 921670352349030554]
DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1028768087263234868
DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1045740236536355596
DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1184422937682103096
DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1201924032068728250

All the nodes appear to be reporting "Not pulling schema becuase versions
match or shouldPullSchmeaFrom returned false. That code
(MigrationManager.java) makes reference to a "gossip only" node, did we get
stuck in that somehow.

On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A  wrote:

>
>
>
>
> Adding one node at a time – is that successful?
>
> Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml
> and increase if needed.
>
> Have you tried Nodetool bootstrap resume & jvm option i.e.
> JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false"  ?
>
>
>
>
>
> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID]
> *Sent:* Wednesday, June 12, 2019 11:35 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: postmortem on 2.2.13 scale out difficulties
>
>
>
> We only were able to scale out four nodes and then failures started
> occurring, including multiple instances of nodes joining a cluster without
> streaming.
>
>
>
> Sigh.
>
>
>
> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
> wrote:
>
> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS,
> IPV6
>
> Needed to scale out the asia datacenter, which was 5 nodes, europe and us
> were 25 nodes
>
> We were running into bootstrapping issues where the new node failed to
> bootstrap/stream, it failed with
>
>
>
> "java.lang.RuntimeException: A node required to move the data consistently
> is down"
>
>
>
> ...even though they were all up based on nodetool status prior to adding
> the node.
>
> First we increased the phi_convict_threshold to 12, and that did not help.
>
> CASSANDRA-12281 appeared similar to what we had problems with, but I don't
> think we hit that. Somewhere in there someone wrote
>
>
>
> "For us, the workaround is either deleting the data (then bootstrap
> again), or increasing the ring_delay_ms. And the larger the cluster is, the
> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster,
> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
> >100seconds. Default is 30seconds."
>
> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
> it finally worked.
>
> side note:
>
> During the rolling restarts for setti

RE: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread ZAIDI, ASAD A


Adding one node at a time – is that successful?
Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml and 
increase if needed.
Have you tried Nodetool bootstrap resume & jvm option i.e. JVM_OPTS="$JVM_OPTS 
-Dcassandra.consistent.rangemovement=false"  ?


From: Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID]
Sent: Wednesday, June 12, 2019 11:35 AM
To: user@cassandra.apache.org
Subject: Re: postmortem on 2.2.13 scale out difficulties

We only were able to scale out four nodes and then failures started occurring, 
including multiple instances of nodes joining a cluster without streaming.

Sigh.

On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
mailto:carl.muel...@smartthings.com>> wrote:
We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, IPV6

Needed to scale out the asia datacenter, which was 5 nodes, europe and us were 
25 nodes

We were running into bootstrapping issues where the new node failed to 
bootstrap/stream, it failed with

"java.lang.RuntimeException: A node required to move the data consistently is 
down"

...even though they were all up based on nodetool status prior to adding the 
node.

First we increased the phi_convict_threshold to 12, and that did not help.

CASSANDRA-12281 appeared similar to what we had problems with, but I don't 
think we hit that. Somewhere in there someone wrote

"For us, the workaround is either deleting the data (then bootstrap again), or 
increasing the ring_delay_ms. And the larger the cluster is, the longer 
ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, it 
requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, >100seconds. 
Default is 30seconds."
Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and it 
finally worked.

side note:

During the rolling restarts for setting phi_convict_threshold we observed quite 
a lot of status map variance between nodes (we have a program to poll all of a 
datacenter or cluster's view of the gossipinfo and statuses. AWS appears to 
have variance in networking based on the phi_convict_threshold advice, I'm not 
sure if our difficulties were typical in that regard and/or if our IPV6 and/or 
globally distributed datacenters were exacerbating factors.

We could not reproduce this in loadtest, although loadtest is only eu and us 
(but is IPV6)


Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
We're getting

DEBUG [GossipStage:1] 2019-06-12 15:20:07,797 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false

multiple times, as it contacts the nodes.

On Wed, Jun 12, 2019 at 11:35 AM Carl Mueller 
wrote:

> We only were able to scale out four nodes and then failures started
> occurring, including multiple instances of nodes joining a cluster without
> streaming.
>
> Sigh.
>
> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
> wrote:
>
>> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS,
>> IPV6
>>
>> Needed to scale out the asia datacenter, which was 5 nodes, europe and us
>> were 25 nodes
>>
>> We were running into bootstrapping issues where the new node failed to
>> bootstrap/stream, it failed with
>>
>> "java.lang.RuntimeException: A node required to move the data
>> consistently is down"
>>
>> ...even though they were all up based on nodetool status prior to adding
>> the node.
>>
>> First we increased the phi_convict_threshold to 12, and that did not
>> help.
>>
>> CASSANDRA-12281 appeared similar to what we had problems with, but I
>> don't think we hit that. Somewhere in there someone wrote
>>
>> "For us, the workaround is either deleting the data (then bootstrap
>> again), or increasing the ring_delay_ms. And the larger the cluster is, the
>> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster,
>> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
>> >100seconds. Default is 30seconds."
>>
>> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
>> it finally worked.
>>
>> side note:
>>
>> During the rolling restarts for setting phi_convict_threshold we observed
>> quite a lot of status map variance between nodes (we have a program to poll
>> all of a datacenter or cluster's view of the gossipinfo and statuses. AWS
>> appears to have variance in networking based on the phi_convict_threshold
>> advice, I'm not sure if our difficulties were typical in that regard and/or
>> if our IPV6 and/or globally distributed datacenters were exacerbating
>> factors.
>>
>> We could not reproduce this in loadtest, although loadtest is only eu and
>> us (but is IPV6)
>>
>


Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
We only were able to scale out four nodes and then failures started
occurring, including multiple instances of nodes joining a cluster without
streaming.

Sigh.

On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
wrote:

> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS,
> IPV6
>
> Needed to scale out the asia datacenter, which was 5 nodes, europe and us
> were 25 nodes
>
> We were running into bootstrapping issues where the new node failed to
> bootstrap/stream, it failed with
>
> "java.lang.RuntimeException: A node required to move the data consistently
> is down"
>
> ...even though they were all up based on nodetool status prior to adding
> the node.
>
> First we increased the phi_convict_threshold to 12, and that did not help.
>
> CASSANDRA-12281 appeared similar to what we had problems with, but I don't
> think we hit that. Somewhere in there someone wrote
>
> "For us, the workaround is either deleting the data (then bootstrap
> again), or increasing the ring_delay_ms. And the larger the cluster is, the
> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster,
> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
> >100seconds. Default is 30seconds."
>
> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
> it finally worked.
>
> side note:
>
> During the rolling restarts for setting phi_convict_threshold we observed
> quite a lot of status map variance between nodes (we have a program to poll
> all of a datacenter or cluster's view of the gossipinfo and statuses. AWS
> appears to have variance in networking based on the phi_convict_threshold
> advice, I'm not sure if our difficulties were typical in that regard and/or
> if our IPV6 and/or globally distributed datacenters were exacerbating
> factors.
>
> We could not reproduce this in loadtest, although loadtest is only eu and
> us (but is IPV6)
>