>From what I can tell, Histograms don't seem to be used extensively in the 
>Kafka server (only in RequestChannel.scala) and I'm not sure we need them for 
>per-client metrics. Topic metrics use meters currently.  Migrating graphing, 
>alerting will be quite a significant effort for all users of Kafka. Do the 
>potential benefits of the new metrics package outweigh this one time 
>migration? In the long run it seems nice to have a unified metrics package 
>across clients and server. If we were starting out from scratch without any 
>existing deployments, what decision would we take?

I suppose the relative effort in supporting is a useful data point in this 
discussion. We need to throttle based on the current byte rate which should be 
a "Meter" in codahale terms. The Meter implementation uses a 1, 5 and 15 minute 
exponential window moving average. The library also does not use the most 
recent samples of data for Metered metrics. For calculating rates, the EWMA 
class has a scheduled task that runs every 5 seconds and adjusts the rate using 
the new data accordingly. In that particular case, I think the new library is 
superior since it is more responsive.  If we do choose to remain with Yammer on 
the server, here are a few ideas on how to support quotas with relatively less 
effort.  

- We could have a new type of Meter called "QuotaMeter" that can wrap the 
existing meter code that follows the same pattern that the Sensor does in the 
new metrics library. This QuotaMeter needs to be configured with a Quota and it 
can have a finer grained rate than 1 minute (10 seconds? configurable?). 
Anytime we call "mark()", it update the underlying rates and throw a 
QuotaViolationException if required. This class can either extend Meter or be a 
separate implementation of the Metric superclass that every metric implements.

- We can also consider implementing these quotas with the new metrics package 
and have these co-exist with the existing metrics. This leads to 2 metric 
packages being used on the server, but they are both pulled in as dependencies 
anyway. Using this for metrics we can quota on may not be a bad place to start.

Thanks,
Aditya
________________________________________
From: Jay Kreps [jay.kr...@gmail.com]
Sent: Wednesday, March 25, 2015 11:08 PM
To: dev@kafka.apache.org
Subject: Re: Metrics package discussion

Here was my understanding of the issue last time.

The yammer metrics use a random sample of requests to estimate the
histogram. This allocates a fairly large array of longs (their values are
longs rather than floats). A reasonable sample might be 8k entries which
would give about 64KB per histogram. There are bounds on accuracy, but they
are only probabilistic. I.e. if you try to get 99% < 5 ms of inaccuracy,
you will 1% of the time get more than this. This is okay but if you try to
alert, in which you realize that being wrong 1% of the time is a lot if you
are computing stats every second continuously on many metrics (i.e. 1 in
100 estimates will be outside you bound). This array is copied in full
every time you check the metric which is the other cause of the memory
pressure.

The better approach to histograms is to calculate buckets boundaries and
record arbitrarily many values in those buckets. A simple bucketing
approach for latency would be 0, 5ms, 10ms, 15ms, etc, and you just count
how many fall in each bucket. Your precision is deterministically bounded
by the bucket boundaries, so if you had 5ms buckets you would never have
more than 5ms loss of precision. By using non-uniform bucket sizes you can
make this work even better (e.g. give ~1ms precision for latencies in the
1ms range, but give only 1 second precision for latencies in the 30 second
range). That is what is implemented in that metrics package.

I think this bucketing approach is popular now. There is a whole "HDR
histogram" library that gives lots of different bucketing methods and
implements dynamic resizing so you don't have to specify an upper bound.
 https://github.com/HdrHistogram/HdrHistogram

Whether this matters depends entirely if you want histograms broken down at
the client, topic, partition, or broker level or just want overall metrics.
If we just want per sever aggregates for histograms then I think the memory
usage is not a huge issue. If you want a histogram per topic or client or
partition and have 10k of these then that is where you start talking like
1GB of memory with the yammer package, which is what we hit last time.
Getting percentiles on the client level is nice, percentiles are definitely
better than averages, but I'm not sure it is required.

-Jay

On Wed, Mar 25, 2015 at 9:43 PM, Neha Narkhede <n...@confluent.io> wrote:

> Aditya,
>
> If we are doing a deep dive, one of the things to investigate would be
> memory/GC performance. IIRC, when I was looking into codahale at LinkedIn,
> I remember it having quite a few memory management and GC issues while
> using histograms. In comparison, histograms in the new metrics package
> aren't very well tested.
>
> Thanks,
> Neha
>
> On Wed, Mar 25, 2015 at 8:25 AM, Aditya Auradkar <
> aaurad...@linkedin.com.invalid> wrote:
>
> > Hey everyone,
> >
> > Picking up this discussion after yesterdays KIP hangout. For anyone who
> > did not join the meeting, we have 2 different metrics packages being used
> > by the clients (custom package) and the server (codahale). We are
> > discussing whether to migrate the server to the new package.
> >
> > What information do we need in order to make a decision?
> >
> > Some pros of the new package:
> > - Using the most recent information by combining data from previous and
> > current samples. I'm not sure how codahale does this so I'll investigate.
> > - We can quota on anything we measure. This is pretty cool IMO. I've
> > investigate the feasibility of adding this feature in codahale.
> > - Hierarchical metrics. For example: we can define a sensor for overall
> > bytes-in/bytes-out and also per-client. Updating the client sensor will
> > cause the global byte rate sensor to get modified too.
> >
> > What are some of the issues with codahale? One previous discussion
> > mentions high memory usage but I don't have any experience with it
> myself.
> >
> > Thanks,
> > Aditya
> >
> >
> >
> >
> >
>
>
> --
> Thanks,
> Neha
>

Reply via email to