Re: Guidelines for configuring Thresholds for Cassandra metrics

Thomas Julian Thu, 01 Sep 2016 01:29:50 -0700

Hi Ryan,



Thank you for the exhaustive list of params to monitor and for your detailed 
reply. It will definitely help.



Nagaraj,



We are planning to use ManageEngine Application Manager to monitor Cassandra 
Instances.



Scott,



We will try out this tool as well.



Best Regards,

Julian.










---- On Mon, 29 Aug 2016 20:52:16 +0530 Ryan Svihla 
&lt;r...@foundev.pro&gt;wrote ---- 




Benedict I really don't want to turn this into a battle about who's opinion is 
more valid and I really respect all the good work you've done for Apache 
Cassandra.



I'll just reiterate that I'm comfortable saying 0.6 is a good starting point 
and it is often not the ideal once you go through more thorough testing, all of 
which I said initially and I still think is a reasonable statement.



-regards,



Ryan Svihla












On Sat, Aug 27, 2016 at 9:31 AM -0500, "Benedict Elliott Smith" 
&lt;bened...@apache.org&gt; wrote:

 





I did not claim you had no evidence, only that your statement lacked 
justification.  Again, nuance is important.



I was suggesting that blanket statements without the necessary caveats, to the 
user mailing list, countermanding the defaults without 'justification' 
(explanation, reasoning) is liable to cause confusion on what best practice is. 
 I attempted to provide some of the missing context to minimise this confusion 
while still largely agreeing with you.



However you should also bear in mind that you work as a field engineer for 
DataStax, and as such your sample of installation behaviours will be biased - 
towards those where the defaults have not worked well.








On Saturday, 27 August 2016, Ryan Svihla &lt;r...@foundev.pro&gt; wrote:

 I have been trying to get the docs fixed for this for the past 3 months, and 
there already is a ticket open for changing the defaults. I don't feel like 
I've had a small amount of evidence here. All observation in the 3 years of 
work in the field suggests compaction keeps coming up as the bottleneck when 
you push Cassandra ingest.

0.6 as an initial setting has fixed 20+ broken clusters in practice and it 
improved overall performance in every case from defaults of 0.33 to defaults of 
0.03 (yaml suggests per core flush writers, add in the prevelance of HT and you 
see a lot of 24+ flush writer systems in the wild)



No disrespect intended but that default hasn't worked out well at all in my 
exposure to it, and 0.6 has never been worse than the default yet. Obviously 
write patterns, heap configuration, memtable size limits and what not affect 
the exact optimal setting and I've rarely had it end up 0.6 after a tuning 
exercise. I never intended that as a blanket recommendation, just a starting 
one.




_____________________________

From: Benedict Elliott Smith &lt;bened...@apache.org&gt;

Sent: Friday, August 26, 2016 9:40 AM

Subject: Re: Guidelines for configuring Thresholds for Cassandra metrics

To: &lt;user@cassandra.apache.org&gt;





The default when I wrote it was 0.4 but it was found this did not saturate 
flush writers in JBOD configurations. Iirc it now defaults to 1/(1+#disks) 
which is not a terrible default, but obviously comes out much lower if you have 
many disks.



This smaller value behaves better for peak performance, but in a live system 
where compaction is king not saturating flush in return for lower write 
amplification (from flushing larger memtables) will indeed often be a win.



0.6, however, is probably not the best default unless you have a lot of tables 
being actively written to, in which case even 0.8 would be fine. With a single 
main table receiving your writes at a given time, 0.4 is probably an optimal 
value, when making this trade off against peak performance.



Anyway, it's probably better to file a ticket to discuss defaults and 
documentation than making a statement like this without justification. I can 
see where you're coming from, but it's confusing for users to have such blanket 
guidance that counters the defaults.  If the defaults can be improved (which I 
agree they can) it's probably better to do that, along with better 
documentation, so the nuance is accounted for.





On Friday, 26 August 2016, Ryan Svihla &lt;r...@foundev.pro&gt; wrote:



Forgot the most important thing. Logs

ERROR you should investigate

WARN you should have a list of known ones. Use case dependent. Ideally you 
change configuration accordingly.

*PoolCleaner (slab or native) - good indication node is tuned badly if you see 
a ton of this. Set memtable_cleanup_threshold to 0.6 as an initial attempt to 
configure this correctly.  This is a complex topic to dive into, so that may 
not be the best number, it'll likely be better than the default, why its not 
the default is a big conversation.

There are a bunch of other logs I look for that are escaping me at present but 
that's a good start



-regards,



Ryan Svihla









On Fri, Aug 26, 2016 at 7:21 AM -0500, "Ryan Svihla" &lt;r...@foundev.pro&gt; 
wrote:



Thomas,



Not all metrics are KPIs and are only useful when researching a specific issue 
or after a use case specific threshold has been set.



The main "canaries" I monitor are:

* Pending compactions (dependent on the compaction strategy chosen but 1000 is 
a sign of severe issues in all cases)

* dropped mutations (more than one I treat as a event to investigate, I believe 
in allowing operational overhead and any evidence of load shedding suggests I 
may not have as much as I thought)

* blocked anything (flush writers, etc..more than one I investigate)

* system hints ( More than 1k I investigate)

* heap usage and gc time vary a lot by use case and collector chosen, I aim for 
below 65% usage as an average with g1, but this again varies by use case a 
great deal. Sometimes I just looks the chart and query patterns and if they 
don't line up I have to do other deeper investigations

* read and write latencies exceeding SLA is also use case dependent. Those that 
have none I tend to push towards p99 with a middle end SSD based system having 
100ms and a spindle based system having 600ms with CL one and assuming a 
"typical" query pattern (again query patterns and CL so vary here)

* cell count and partition size vary greatly by hardware and gc tuning but I 
like to in the absence of all other relevant information like to keep cell 
count for a partition below 100k and size below 100mb. I however have many 
successful use cases running more and I've had some fail well before that. 
Hardware and tuning tradeoff a shift this around a lot.

There is unfortunately as you'll note a lot of nuance and the load out really 
changes what looks right (down to the model of SSDs I have different 
expectations for p99s if it's a model I haven't used before I'll do some 
comparative testing).



The reason so much of this is general and vague is my selection bias. I'm 
brought in when people are complaining about performance or some grand systemic 
crash because they were monitoring nothing. I have little ability to change 
hardware initially so I have to be willing to allow the hardware to do the best 
it can an establish levels where it can no longer keep up with the customers 
goals. This may mean for some use cases 10 pending compactions is an actionable 
event for them, for another customer 100 is. The better approach is to 
establish a baseline for when these metrics start to indicate a serious issue 
is occurring in that particular app. Basically when people notice a problem, 
what did these numbers look like in the minutes, hours and days prior? That's 
the way to establish the levels consistently.



Regards,



Ryan Svihla
















On Fri, Aug 26, 2016 at 4:48 AM -0500, "Thomas Julian" 
&lt;thomasjul...@zoho.com&gt; wrote:



Hello,



I am working on setting up a monitoring tool to monitor Cassandra Instances. 
Are there any wikis which specifies optimum value for each Cassandra KPIs?

For instance, I am not sure,

What value of "Memtable Columns Count" can be considered as "Normal". 


What value of the same has to be considered as "Critical".


I knew threshold numbers for few params, for instance any thing more than zero 
for timeouts, pending tasks should be considered as unusual. Also, I am aware 
that most of the statistics' threshold numbers vary in accordance with Hardware 
Specification, Cassandra Environment Setup. But, what I request here is a 
general guideline for configuring thresholds for all the metrics.



If this has been already covered, please point me to that resource. If anyone 
on their own interest collected these things, please share.



Any help is appreciated.



Best Regards,

Julian.

Re: Guidelines for configuring Thresholds for Cassandra metrics

Reply via email to