Re: Read timeouts when performing rolling restart

Alain RODRIGUEZ Fri, 14 Sep 2018 02:55:21 -0700

Hello Riccardo.

Going to VPC, use GPFS and NTS all sounds very reasonable to me. As you
said, that's another story. Good luck with this. Yet the current topology
should also work well and I am wondering why the query does not find any
other replica available.


About your problem at hand:

It's unclear to me at this point if the nodes are becoming unresponsive. My
main thought on your first email was that you were facing some issue were
due to the topology or to the client configuration you were missing
replicas, but I cannot see what's wrong (if not authentication indeed, but
you don't use it).
Then I am thinking it might be indeed due to many nodes getting extremely
busy at the moment of the restart (of any of the nodes), because of this:

After rising the compactors to 4 I still see some dropped messages for HINT
> and MUTATIONS. This happens during startup. Reason is "for internal
> timeout". Maybe too many compactors?


Some tuning information/hints:

* The number of *concurrent_compactor* should be between 1/4 and 1/2 of the
total number of cores and generally no more than 8. It should ideally never
be equal to the number of CPU cores as we want power available to process
requests at any moment.
* Another common bottleneck is the disk throughput. If compactions are
running too fast, it can harm as well. I would fix the number of
concurrent_compactors as mentioned above and act on the compaction
throughput instead
* If hints are a problem, or rather saying, to make sure they are involved
in the issue you see, why not disabling hints completely on all nodes and
try a restart? Anything that can be disabled is an optimization. You do not
need hinted handoff if you run a repair later on (or if you operate with a
strong consistency and do not perform deletes for example). You can give
this a try:
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L44-L46
.
* Not as brutal, you can try slowing down the hint transfer speed:
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L57-L67
.
* Check for GC that would be induced by the pressure put by hints delivery,
compactions and the first load of the memory on machine start. Any GC
activity that would be shown in the logs?
* As you are using AWS, tuning the phi_convict_threshold to 10-12 could
help as well not making the node down (if that's what happens).
* Do you see any specific part of the hardware being the bottleneck or bing
especially strongly used during a restart? Maybe use:
'dstat -D <disk> -lvrn 10' (where <disk> is like 'xvdq'). I believe this
command shows Bytes, not bits, thus ‘50M' is 50 MB or 400 Mb.
* What hardware are you using?
* Could you also run a 'watch -n1 -d "nodetool tpstats"' during the nodes
restart as well and see what threads are 'PENDING' during the restart. For
instance, if the flush_writter is pending, the next write to this table has
to wait for the data to be flushed. It can be multiple things, but having
an interactive view of the pending requests might lead you to the root
cause of the issue.

C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari <ferra...@gmail.com> a
écrit :

> Hi Shalom,
>
> It happens almost at every restart, either a single node or a rolling one.
> I do agree with you that it is good, at least on my setup, to wait few
> minutes to let the rebooted node to cool down before moving to the next.
> The more I look at it the more I think is something coming from hint
> dispatching, maybe I should try  something around hints throttling.
>
> Thanks!
>
> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges <shalomsag...@gmail.com>
> wrote:
>
>> Hi Riccardo,
>>
>> Does this issue occur when performing a single restart or after several
>> restarts during a rolling restart (as mentioned in your original post)?
>> We have a cluster that when performing a rolling restart, we prefer to
>> wait ~10-15 minutes between each restart because we see an increase of GC
>> for a few minutes.
>> If we keep restarting the nodes quickly one after the other, the
>> applications experience timeouts (probably due to GC and hints).
>>
>> Hope this helps!
>>
>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari <ferra...@gmail.com>
>> wrote:
>>
>>> A little update on the progress.
>>>
>>> First:
>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>> through the 3.0.6 code. Yup it should be fixed.
>>> Thank you Surbhi. At the moment we don't need authentication as the
>>> instances are locked down.
>>>
>>> Now:
>>> - Unfortunately the start_transport_native trick does not always work.
>>> On some nodes works on other don't. What do I mean? I still experience
>>> timeouts and dropped messages during startup.
>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>> min(n_cores, n_disks))
>>> - After rising the compactors to 4 I still see some dropped messages for
>>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>>> timeout". Maybe too many compactors?
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
>>> wrote:
>>>
>>>> Another thing to notice is :
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '1'}
>>>>
>>>> system_auth has a replication factor of 1 and even if one node is down
>>>> it may impact the system because of the replication factor.
>>>>
>>>>
>>>>
>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>> thomas.steinmau...@dynatrace.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I remember something that a client using the native protocol gets
>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>
>>>>>
>>>>>
>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>
>>>>>
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>> *From:* Riccardo Ferrari <ferra...@gmail.com>
>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>
>>>>>
>>>>>
>>>>> Hi Alain,
>>>>>
>>>>>
>>>>>
>>>>> Thank you for chiming in!
>>>>>
>>>>>
>>>>>
>>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>>> native transport disabled and letting it cool down lead to no timeout
>>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>>> is a workaround
>>>>>
>>>>>
>>>>>
>>>>> # About upgrading:
>>>>>
>>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>>> is going to be a huge pain, top of your head, any breaking change I
>>>>> should absolutely take care of reviewing ?
>>>>>
>>>>>
>>>>>
>>>>> # describecluster output: YES they agree on the same schema version
>>>>>
>>>>>
>>>>>
>>>>> # keyspaces:
>>>>>
>>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '2'}
>>>>>
>>>>>
>>>>>
>>>>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>>
>>>>>
>>>>> # Snitch
>>>>>
>>>>> Ec2Snitch
>>>>>
>>>>>
>>>>>
>>>>> ## About Snitch and replication:
>>>>>
>>>>> - We have the default DC and all nodes are in the same RACK
>>>>>
>>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>>> the cassandra-rackdc accortingly.
>>>>>
>>>>> -- This should be a transparent change, correct?
>>>>>
>>>>>
>>>>>
>>>>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>>>>> with 'us-xxxx' DC and replica counts as before
>>>>>
>>>>> - Then adding a new DC inside the VPC, but this is another story...
>>>>>
>>>>>
>>>>>
>>>>> Any concerns here ?
>>>>>
>>>>>
>>>>>
>>>>> # nodetool status <ks>
>>>>>
>>>>> --  Address         Load       Tokens       Owns (effective)  Host
>>>>> ID                               Rack
>>>>> UN  10.x.x.a  177 GB     256          50.3%
>>>>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>>>>> UN  10.x.x.b    152.46 GB  256          51.8%
>>>>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>>>>> UN  10.x.x.c   159.59 GB  256          49.0%
>>>>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>>>>> UN  10.x.x.d  162.44 GB  256          49.3%
>>>>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>>>>> UN  10.x.x.e    174.9 GB   256          50.5%
>>>>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>>>>> UN  10.x.x.f  194.71 GB  256          49.2%
>>>>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>>>>
>>>>>
>>>>>
>>>>> # gossipinfo
>>>>>
>>>>> /10.x.x.a
>>>>>   STATUS:827:NORMAL,-1350078789194251746
>>>>>   LOAD:289986:1.90078037902E11
>>>>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:290040:0.5934718251228333
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>>>>   RPC_READY:868:true
>>>>>   TOKENS:826:<hidden>
>>>>> /10.x.x.b
>>>>>   STATUS:16:NORMAL,-1023229528754013265
>>>>>   LOAD:7113:1.63730480619E11
>>>>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:7274:0.5988024473190308
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>>>>   TOKENS:15:<hidden>
>>>>> /10.x.x.c
>>>>>   STATUS:732:NORMAL,-1117172759238888547
>>>>>   LOAD:245839:1.71409806942E11
>>>>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:245989:0.0
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>>>>   RPC_READY:763:true
>>>>>   TOKENS:731:<hidden>
>>>>> /10.x.x.d
>>>>>   STATUS:14:NORMAL,-1004942496246544417
>>>>>   LOAD:313125:1.74447964917E11
>>>>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:313215:0.25641027092933655
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>>>>   RPC_READY:56:true
>>>>>   TOKENS:13:<hidden>
>>>>> /10.x.x.e
>>>>>   STATUS:520:NORMAL,-1058809960483771749
>>>>>   LOAD:276118:1.87831573032E11
>>>>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:276217:0.32786884903907776
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>>>>   RPC_READY:550:true
>>>>>   TOKENS:519:<hidden>
>>>>> /10.x.x.f
>>>>>   STATUS:1081:NORMAL,-1039671799603495012
>>>>>   LOAD:239114:2.09082017545E11
>>>>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:239180:0.5665722489356995
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>>>>   RPC_READY:1118:true
>>>>>   TOKENS:1080:<hidden>
>>>>>
>>>>>
>>>>>
>>>>> ## About load and tokens:
>>>>>
>>>>> - While load is pretty even this does not apply to tokens, I guess we
>>>>> have some table with uneven distribution. This should not be the case for
>>>>> high load tabels as partition keys are are build with some 'id + <some 
>>>>> time
>>>>> format>'
>>>>>
>>>>> - I was not able to find some documentation about the numbers printed
>>>>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>>>>
>>>>>
>>>>>
>>>>> # Tombstones
>>>>>
>>>>> No ERRORS, only WARN about a very specific table that we are aware of.
>>>>> It is an append only table read by spark from a batch job. (I guess it is 
>>>>> a
>>>>> read_repair chance or DTCS misconfig)
>>>>>
>>>>>
>>>>>
>>>>> ## Closing note!
>>>>>
>>>>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4
>>>>> spinning drives, some changes to the cassandra.yml:
>>>>>
>>>>>
>>>>>
>>>>> - dynamic_snitch: false
>>>>>
>>>>> - concurrent_reads: 48
>>>>>
>>>>> - concurrent_compactors: 1 (was 2)
>>>>>
>>>>> - disk_optimization_strategy: spinning
>>>>>
>>>>>
>>>>>
>>>>> I have some concerns about the number of concurrent_compactors, what
>>>>> do you think?
>>>>>
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> The contents of this e-mail are intended for the named addressee only.
>>>>> It contains information that may be confidential. Unless you are the named
>>>>> addressee or an authorized designee, you may not copy or use it, or
>>>>> disclose it to anyone else. If you received it in error please notify us
>>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>>
>>>>
>>>
>

Re: Read timeouts when performing rolling restart

Reply via email to