Re: Does percentile metrics follow the rules of summations?

Gil Tene Wed, 21 Dec 2016 09:24:33 -0800

The right way to deal with percentiles (especially when it comes to 
latency) is to assume nothing more than what it says on the label.

The right way to read "99%'ile latency of a" is "1 or a 100 of occurrences 
of 'a' took longer than this. And we have no idea how long". That is the 
only information captured by that metric. It can be used to roughly deduce 
"what is the likelihood that a will take longer than that?". But deducing 
other stuff from it usually simply doesn't work.

Specifically things for which projections don't work include:
(A) the likelihoods of higher or lower percentiles of the same metric a
(B) the likelihood of similar values in neighboring metrics (b, c, or d)
(C) the likelihood of a certain percentile of composite operation (a + b + 
c + d in your example) including the same percentile of a

The reasons for A usually have to do with the sad fact that latency 
distributions are usually strongly multi-modal, and tend to not exhibit any 
form of normal distribution. A given percentile means what it means and 
nothing more, and projecting from one percentile measurement to another 
(unmeasured but extrapolated) is usually a silly act of wishful thinking. 
No amount of wishing that the "shape" of latency distribution was roughly 
known (and hopefully something close to a normal bell curve) will make it 
so. Not even close.

The reasons for B should be obvious.

The reasons for C usually have to do with the fact that the things that 
shape latency distributions in multiple related metrics (e.g. a, b, c, d) 
often exhibit correlation or anti-correlation.

A common cause for high correlations in higher percentiles is that things 
being measured may be commonly impacted by infrastructure or system 
resource artifacts that dominate the causes for their higher latencies. 
E.g. if a, b, and c are running on the same system and that system 
experiences some sort of momentery "glitch" (e.g. a periodic internal book 
keeping operation), their higher percentiles may be highly correlated. 
Similarly when momentary concentrations and spikes in arrival rates cause 
higher latencies due to queue buildups, and similarly when the cause of the 
longer latency is the complexity or size of the specific operation.

Anti-correlation is often seen when the occurrence of a higher latency in 
one component makes the likelihood of a higher latency in another component 
in the same sequence less likely that it normally would be. The causes for 
anti-correlation can vary widely, but one common example I see is when the 
things performing a, b, c, d utilize some cached state services, and high 
latencies are dominated by "misses" in those caches. In systems that work 
and behave like that, it is common to see one of the steps effectively 
"constructively prefetch" state for the others, making the likelihood off a 
high-opercentile-causing "miss" in the cache on "a" be much higher than a 
similar miss in b, c, or d. This "constructive pre-fetching" effect occurs 
naturally with all sorts of caches, from memcache to disk and networked 
storage system caches to OS file caches to CPU caches.  

On Wednesday, December 21, 2016 at 2:53:45 AM UTC-8, Gaurav Abbi wrote:
>
> Hi,
> We are collecting certain metrics using (*Graphite + Grafana*) use them 
> as a tool to monitor system health and performance. 
>
> For one of the latency metric, we get the total time as well as the 
> latencies for all the sub-components it is composed of.
>
> We display 99th percentile for all the values. However, if we sum up the 
> 99th percentiles for latencies of sub-components, they do not equate to the 
> 99th percentile of the total time.
>
> Essentially it comes down if the percentiles can follow summation rules. 
> i.e.
>
> if 
> *a + b + c + d = s*
>
> then,
> *p99(a) + p99(b) + p99(c) + p99(d) = p99(s) ?*
>
> Will this hold?
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Does percentile metrics follow the rules of summations?

Reply via email to