A little update on the progress.

First:
Thank you Thomas. I checked the code in the patch and briefly skimmed
through the 3.0.6 code. Yup it should be fixed.
Thank you Surbhi. At the moment we don't need authentication as the
instances are locked down.

Now:
- Unfortunately the start_transport_native trick does not always work. On
some nodes works on other don't. What do I mean? I still experience
timeouts and dropped messages during startup.
- I realized that cutting the concurrent_compactors to 1 was not really a
good idea, minimum vlaue should be 2, currently testing 4 (that is the
min(n_cores, n_disks))
- After rising the compactors to 4 I still see some dropped messages for
HINT and MUTATIONS. This happens during startup. Reason is "for internal
timeout". Maybe too many compactors?

Thanks!


On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
wrote:

> Another thing to notice is :
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_auth has a replication factor of 1 and even if one node is down it
> may impact the system because of the replication factor.
>
>
>
> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Hi,
>>
>>
>>
>> I remember something that a client using the native protocol gets
>> notified too early by Cassandra being ready due to the following issue:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>
>>
>>
>> which looks similar, but above was marked as fixed in 2.2.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Riccardo Ferrari <ferra...@gmail.com>
>> *Sent:* Mittwoch, 12. September 2018 18:25
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Read timeouts when performing rolling restart
>>
>>
>>
>> Hi Alain,
>>
>>
>>
>> Thank you for chiming in!
>>
>>
>>
>> I was thinking to perform the 'start_native_transport=false' test as well
>> and indeed the issue is not showing up. Starting the/a node with native
>> transport disabled and letting it cool down lead to no timeout exceptions
>> no dropped messages, simply a crystal clean startup. Agreed it is a
>> workaround
>>
>>
>>
>> # About upgrading:
>>
>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>> reviewing all the changes from 3.0.6 to 3.0.17
>> is going to be a huge pain, top of your head, any breaking change I
>> should absolutely take care of reviewing ?
>>
>>
>>
>> # describecluster output: YES they agree on the same schema version
>>
>>
>>
>> # keyspaces:
>>
>> system WITH replication = {'class': 'LocalStrategy'}
>>
>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>> system_traces WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '2'}
>>
>>
>>
>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>>
>>
>> # Snitch
>>
>> Ec2Snitch
>>
>>
>>
>> ## About Snitch and replication:
>>
>> - We have the default DC and all nodes are in the same RACK
>>
>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>> cassandra-rackdc accortingly.
>>
>> -- This should be a transparent change, correct?
>>
>>
>>
>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>> with 'us-xxxx' DC and replica counts as before
>>
>> - Then adding a new DC inside the VPC, but this is another story...
>>
>>
>>
>> Any concerns here ?
>>
>>
>>
>> # nodetool status <ks>
>>
>> --  Address         Load       Tokens       Owns (effective)  Host
>> ID                               Rack
>> UN  10.x.x.a  177 GB     256          50.3%
>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>> UN  10.x.x.b    152.46 GB  256          51.8%
>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>> UN  10.x.x.c   159.59 GB  256          49.0%
>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>> UN  10.x.x.d  162.44 GB  256          49.3%
>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>> UN  10.x.x.e    174.9 GB   256          50.5%
>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>> UN  10.x.x.f  194.71 GB  256          49.2%
>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>
>>
>>
>> # gossipinfo
>>
>> /10.x.x.a
>>   STATUS:827:NORMAL,-1350078789194251746
>>   LOAD:289986:1.90078037902E11
>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:290040:0.5934718251228333
>>   NET_VERSION:1:10
>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>   RPC_READY:868:true
>>   TOKENS:826:<hidden>
>> /10.x.x.b
>>   STATUS:16:NORMAL,-1023229528754013265
>>   LOAD:7113:1.63730480619E11
>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:7274:0.5988024473190308
>>   NET_VERSION:1:10
>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>   TOKENS:15:<hidden>
>> /10.x.x.c
>>   STATUS:732:NORMAL,-1117172759238888547
>>   LOAD:245839:1.71409806942E11
>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:245989:0.0
>>   NET_VERSION:1:10
>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>   RPC_READY:763:true
>>   TOKENS:731:<hidden>
>> /10.x.x.d
>>   STATUS:14:NORMAL,-1004942496246544417
>>   LOAD:313125:1.74447964917E11
>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:313215:0.25641027092933655
>>   NET_VERSION:1:10
>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>   RPC_READY:56:true
>>   TOKENS:13:<hidden>
>> /10.x.x.e
>>   STATUS:520:NORMAL,-1058809960483771749
>>   LOAD:276118:1.87831573032E11
>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:276217:0.32786884903907776
>>   NET_VERSION:1:10
>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>   RPC_READY:550:true
>>   TOKENS:519:<hidden>
>> /10.x.x.f
>>   STATUS:1081:NORMAL,-1039671799603495012
>>   LOAD:239114:2.09082017545E11
>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:239180:0.5665722489356995
>>   NET_VERSION:1:10
>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>   RPC_READY:1118:true
>>   TOKENS:1080:<hidden>
>>
>>
>>
>> ## About load and tokens:
>>
>> - While load is pretty even this does not apply to tokens, I guess we
>> have some table with uneven distribution. This should not be the case for
>> high load tabels as partition keys are are build with some 'id + <some time
>> format>'
>>
>> - I was not able to find some documentation about the numbers printed
>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>
>>
>>
>> # Tombstones
>>
>> No ERRORS, only WARN about a very specific table that we are aware of. It
>> is an append only table read by spark from a batch job. (I guess it is a
>> read_repair chance or DTCS misconfig)
>>
>>
>>
>> ## Closing note!
>>
>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
>> drives, some changes to the cassandra.yml:
>>
>>
>>
>> - dynamic_snitch: false
>>
>> - concurrent_reads: 48
>>
>> - concurrent_compactors: 1 (was 2)
>>
>> - disk_optimization_strategy: spinning
>>
>>
>>
>> I have some concerns about the number of concurrent_compactors, what do
>> you think?
>>
>>
>>
>> Thanks!
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>

Reply via email to