Re: Read timeouts when performing rolling restart

Riccardo Ferrari Tue, 18 Sep 2018 01:10:06 -0700

Allright, coming back after some homeworks. Thank you Alain

About hardware:
m1.xlarge 4vcpu 15GB ram
- 4 spinning disks configured in raid0


#About compactors:
- I 've moved them back to 2 concurrent compactors. I can say in general I
don't see more than 80-ish pending compactions (during compaction times).
This is true when using watch nodetool tpstats too
- Throughput is 16MB/s

#About Hints
When a node boots up I clearly see a spike on pending compactions (still
around 80-ish). During boot when it starts receiving hints the system load
grows to an unsastainalbe level (30+ )in the logs I get the message
"[HINTS|MUTATION] messages were dropped in the last 5000ms..."
Now, after tuning compactors I still see some dropped messages (all of them
MUTATION or. HINTS). On some nodes is as low 0 on other nodes as high as
32k. In particular out of 6 nodes there is one doing 32k dropped messages
one doing 16k and one few hundreds, and somehow the are always the same
nodes.

#About GC
I have moved all my nodes on CMS: Xms and Xmx 8G and Xmn4G. You already
helped on the JVM tuning. Despite G1 was doing pretty good CMS turned out
to be more consistent. Under heavy load G1 can stop longer than CMS. GC
pauses as seen on couple of nodes are between 200 and 430ms.


# Couple of notes:
I see some nodes with higher system load (not data) than other in
particular one of the nodes despite has 170+GB of data its system load
barely and rarely reach 2.0.
I recently adopted Reaper (Thanks TLP!). Out of my 2 keyspaces one is
running (repairing) just fine the second one that is bigger/older is simply
not progressing. Maybe this can give a hint on where to look into...

Thanks!

On Fri, Sep 14, 2018 at 11:54 AM, Alain RODRIGUEZ <arodr...@gmail.com>
wrote:

> Hello Riccardo.
>
> Going to VPC, use GPFS and NTS all sounds very reasonable to me. As you
> said, that's another story. Good luck with this. Yet the current topology
> should also work well and I am wondering why the query does not find any
> other replica available.
>
> About your problem at hand:
>
> It's unclear to me at this point if the nodes are becoming unresponsive.
> My main thought on your first email was that you were facing some issue
> were due to the topology or to the client configuration you were missing
> replicas, but I cannot see what's wrong (if not authentication indeed, but
> you don't use it).
> Then I am thinking it might be indeed due to many nodes getting extremely
> busy at the moment of the restart (of any of the nodes), because of this:
>
> After rising the compactors to 4 I still see some dropped messages for
>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>> timeout". Maybe too many compactors?
>
>
> Some tuning information/hints:
>
> * The number of *concurrent_compactor* should be between 1/4 and 1/2 of
> the total number of cores and generally no more than 8. It should ideally
> never be equal to the number of CPU cores as we want power available to
> process requests at any moment.
> * Another common bottleneck is the disk throughput. If compactions are
> running too fast, it can harm as well. I would fix the number of
> concurrent_compactors as mentioned above and act on the compaction
> throughput instead
> * If hints are a problem, or rather saying, to make sure they are involved
> in the issue you see, why not disabling hints completely on all nodes and
> try a restart? Anything that can be disabled is an optimization. You do not
> need hinted handoff if you run a repair later on (or if you operate with a
> strong consistency and do not perform deletes for example). You can give
> this a try: https://github.com/apache/cassandra/blob/cassandra-3.0.
> 6/conf/cassandra.yaml#L44-L46.
> * Not as brutal, you can try slowing down the hint transfer speed:
> https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/
> cassandra.yaml#L57-L67.
> * Check for GC that would be induced by the pressure put by hints
> delivery, compactions and the first load of the memory on machine start.
> Any GC activity that would be shown in the logs?
> * As you are using AWS, tuning the phi_convict_threshold to 10-12 could
> help as well not making the node down (if that's what happens).
> * Do you see any specific part of the hardware being the bottleneck or
> bing especially strongly used during a restart? Maybe use:
> 'dstat -D <disk> -lvrn 10' (where <disk> is like 'xvdq'). I believe this
> command shows Bytes, not bits, thus ‘50M' is 50 MB or 400 Mb.
> * What hardware are you using?
> * Could you also run a 'watch -n1 -d "nodetool tpstats"' during the nodes
> restart as well and see what threads are 'PENDING' during the restart. For
> instance, if the flush_writter is pending, the next write to this table has
> to wait for the data to be flushed. It can be multiple things, but having
> an interactive view of the pending requests might lead you to the root
> cause of the issue.
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari <ferra...@gmail.com> a
> écrit :
>
>> Hi Shalom,
>>
>> It happens almost at every restart, either a single node or a rolling
>> one. I do agree with you that it is good, at least on my setup, to wait few
>> minutes to let the rebooted node to cool down before moving to the next.
>> The more I look at it the more I think is something coming from hint
>> dispatching, maybe I should try  something around hints throttling.
>>
>> Thanks!
>>
>> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges <shalomsag...@gmail.com>
>> wrote:
>>
>>> Hi Riccardo,
>>>
>>> Does this issue occur when performing a single restart or after several
>>> restarts during a rolling restart (as mentioned in your original post)?
>>> We have a cluster that when performing a rolling restart, we prefer to
>>> wait ~10-15 minutes between each restart because we see an increase of GC
>>> for a few minutes.
>>> If we keep restarting the nodes quickly one after the other, the
>>> applications experience timeouts (probably due to GC and hints).
>>>
>>> Hope this helps!
>>>
>>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari <ferra...@gmail.com>
>>> wrote:
>>>
>>>> A little update on the progress.
>>>>
>>>> First:
>>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>>> through the 3.0.6 code. Yup it should be fixed.
>>>> Thank you Surbhi. At the moment we don't need authentication as the
>>>> instances are locked down.
>>>>
>>>> Now:
>>>> - Unfortunately the start_transport_native trick does not always work.
>>>> On some nodes works on other don't. What do I mean? I still experience
>>>> timeouts and dropped messages during startup.
>>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>>> min(n_cores, n_disks))
>>>> - After rising the compactors to 4 I still see some dropped messages
>>>> for HINT and MUTATIONS. This happens during startup. Reason is "for
>>>> internal timeout". Maybe too many compactors?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <surbhi.gupt...@gmail.com
>>>> > wrote:
>>>>
>>>>> Another thing to notice is :
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_auth has a replication factor of 1 and even if one node is down
>>>>> it may impact the system because of the replication factor.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>>> thomas.steinmau...@dynatrace.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I remember something that a client using the native protocol gets
>>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>>
>>>>>>
>>>>>>
>>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Riccardo Ferrari <ferra...@gmail.com>
>>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Alain,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you for chiming in!
>>>>>>
>>>>>>
>>>>>>
>>>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>>>> native transport disabled and letting it cool down lead to no timeout
>>>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>>>> is a workaround
>>>>>>
>>>>>>
>>>>>>
>>>>>> # About upgrading:
>>>>>>
>>>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>>>> is going to be a huge pain, top of your head, any breaking change I
>>>>>> should absolutely take care of reviewing ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # describecluster output: YES they agree on the same schema version
>>>>>>
>>>>>>
>>>>>>
>>>>>> # keyspaces:
>>>>>>
>>>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>>>
>>>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>>>
>>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '1'}
>>>>>>
>>>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '3'}
>>>>>>
>>>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '2'}
>>>>>>
>>>>>>
>>>>>>
>>>>>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '3'}
>>>>>>
>>>>>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '3'}
>>>>>>
>>>>>>
>>>>>>
>>>>>> # Snitch
>>>>>>
>>>>>> Ec2Snitch
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## About Snitch and replication:
>>>>>>
>>>>>> - We have the default DC and all nodes are in the same RACK
>>>>>>
>>>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>>>> the cassandra-rackdc accortingly.
>>>>>>
>>>>>> -- This should be a transparent change, correct?
>>>>>>
>>>>>>
>>>>>>
>>>>>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>>>>>> with 'us-xxxx' DC and replica counts as before
>>>>>>
>>>>>> - Then adding a new DC inside the VPC, but this is another story...
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any concerns here ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # nodetool status <ks>
>>>>>>
>>>>>> --  Address         Load       Tokens       Owns (effective)  Host
>>>>>> ID                               Rack
>>>>>> UN  10.x.x.a  177 GB     256          50.3%
>>>>>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>>>>>> UN  10.x.x.b    152.46 GB  256          51.8%
>>>>>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>>>>>> UN  10.x.x.c   159.59 GB  256          49.0%
>>>>>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>>>>>> UN  10.x.x.d  162.44 GB  256          49.3%
>>>>>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>>>>>> UN  10.x.x.e    174.9 GB   256          50.5%
>>>>>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>>>>>> UN  10.x.x.f  194.71 GB  256          49.2%
>>>>>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>>>>>
>>>>>>
>>>>>>
>>>>>> # gossipinfo
>>>>>>
>>>>>> /10.x.x.a
>>>>>>   STATUS:827:NORMAL,-1350078789194251746
>>>>>>   LOAD:289986:1.90078037902E11
>>>>>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:290040:0.5934718251228333
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>>>>>   RPC_READY:868:true
>>>>>>   TOKENS:826:<hidden>
>>>>>> /10.x.x.b
>>>>>>   STATUS:16:NORMAL,-1023229528754013265
>>>>>>   LOAD:7113:1.63730480619E11
>>>>>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:7274:0.5988024473190308
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>>>>>   TOKENS:15:<hidden>
>>>>>> /10.x.x.c
>>>>>>   STATUS:732:NORMAL,-1117172759238888547
>>>>>>   LOAD:245839:1.71409806942E11
>>>>>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:245989:0.0
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>>>>>   RPC_READY:763:true
>>>>>>   TOKENS:731:<hidden>
>>>>>> /10.x.x.d
>>>>>>   STATUS:14:NORMAL,-1004942496246544417
>>>>>>   LOAD:313125:1.74447964917E11
>>>>>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:313215:0.25641027092933655
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>>>>>   RPC_READY:56:true
>>>>>>   TOKENS:13:<hidden>
>>>>>> /10.x.x.e
>>>>>>   STATUS:520:NORMAL,-1058809960483771749
>>>>>>   LOAD:276118:1.87831573032E11
>>>>>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:276217:0.32786884903907776
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>>>>>   RPC_READY:550:true
>>>>>>   TOKENS:519:<hidden>
>>>>>> /10.x.x.f
>>>>>>   STATUS:1081:NORMAL,-1039671799603495012
>>>>>>   LOAD:239114:2.09082017545E11
>>>>>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:239180:0.5665722489356995
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>>>>>   RPC_READY:1118:true
>>>>>>   TOKENS:1080:<hidden>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## About load and tokens:
>>>>>>
>>>>>> - While load is pretty even this does not apply to tokens, I guess we
>>>>>> have some table with uneven distribution. This should not be the case for
>>>>>> high load tabels as partition keys are are build with some 'id + <some 
>>>>>> time
>>>>>> format>'
>>>>>>
>>>>>> - I was not able to find some documentation about the numbers printed
>>>>>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # Tombstones
>>>>>>
>>>>>> No ERRORS, only WARN about a very specific table that we are aware
>>>>>> of. It is an append only table read by spark from a batch job. (I guess 
>>>>>> it
>>>>>> is a read_repair chance or DTCS misconfig)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## Closing note!
>>>>>>
>>>>>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4
>>>>>> spinning drives, some changes to the cassandra.yml:
>>>>>>
>>>>>>
>>>>>>
>>>>>> - dynamic_snitch: false
>>>>>>
>>>>>> - concurrent_reads: 48
>>>>>>
>>>>>> - concurrent_compactors: 1 (was 2)
>>>>>>
>>>>>> - disk_optimization_strategy: spinning
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have some concerns about the number of concurrent_compactors, what
>>>>>> do you think?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> The contents of this e-mail are intended for the named addressee
>>>>>> only. It contains information that may be confidential. Unless you are 
>>>>>> the
>>>>>> named addressee or an authorized designee, you may not copy or use it, or
>>>>>> disclose it to anyone else. If you received it in error please notify us
>>>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>>>
>>>>>
>>>>
>>

Re: Read timeouts when performing rolling restart

Reply via email to