Re: Metrics package discussion

Jun Rao Fri, 24 Apr 2015 06:17:43 -0700

Otis,

The jira for moving the broker to the new metrics is KAFKA-1930.


We didn't try to do the conversion in 0.8.2 because (1) the new metrics are
missing reporters for popular systems like Graphite and Ganglia; (2) the
histogram support in the new metrics is a bit different and we were not
sure if it's good enough for our usage. We will need to have an answer to
both before we can migrate to the new metrics. So, the migration may not
happen in 0.8.3.

One of the reasons that we want to move to the new metrics is that as we
are reusing more and more code from the java client, we will be pulling in
metrics in the new format. In order to keep the metrics consistent, it's
probably better to just bite the bullet and migrate all code hale metrics
to the new one.

Thanks,

Jun

On Tue, Apr 21, 2015 at 9:29 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> I'm veeeeeery late to this thread.  I'm with Gwen about metrics being the
> public API (but often not treated as such, sadly).  I don't know the
> details of internal issues around how metrics are implemented but, for
> selfish reasons, would hate to see MBeans change - we spent weeks
> contributing more than a dozen iterations of patches for changing the old
> Kafka 0.8.1.x metrics to what they are now in 0.8.2.  I wish somebody had
> mentioned these (known?) issues then - since metrics were so drastically
> changed then, we could have done it right immediately.  Also, when you
> change MBean names and structure you force everyone to rewrite their MBean
> parsers (not your problem, but still something to be aware of).
>
> If metrics are going to be changing, would it be possible to enumerate the
> changes somewhere?
>
> Finally, I tried finding a JIRA issue for changing metrics, so I can watch
> it, but couldn't find it here:
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%200.8.3%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
>
> Am I looking in the wrong place?
> Is there an issue for the changes discussed in this thread?
> Is the decision to do it in 0.8.3 final?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Mar 31, 2015 at 12:43 PM, Steven Wu <stevenz...@gmail.com> wrote:
>
> > > My main concern is that we don't do the migration in 0.8.3, we will be
> > left
> > with some metrics in YM format and some others in KM format (as we start
> > sharing client code on the broker). This is probably a worse situation to
> > be in.
> >
> > +1. I am not sure how our servo adaptor will work if there are two
> formats
> > for metrics? unless there is an easy way to check the format (YM/KM).
> >
> >
> > On Tue, Mar 31, 2015 at 9:40 AM, Jun Rao <j...@confluent.io> wrote:
> >
> > > (2) The metrics are clearly part of the client API and we are not
> > changing
> > > that (at least for the new client). Arguably, the metrics are also part
> > of
> > > the broker side API. However, since they affect fewer parties (mostly
> > just
> > > the Kafka admins), it may be easier to make those changes.
> > >
> > > My main concern is that we don't do the migration in 0.8.3, we will be
> > left
> > > with some metrics in YM format and some others in KM format (as we
> start
> > > sharing client code on the broker). This is probably a worse situation
> to
> > > be in.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Mar 31, 2015 at 9:26 AM, Gwen Shapira <gshap...@cloudera.com>
> > > wrote:
> > >
> > > > (2) I believe we agreed that our metrics are a public API. I believe
> > > > we also agree we don't break API in minor releases. So, it seems
> > > > obvious to me that we can't make breaking changes to metrics in minor
> > > > releases. I'm not convinced "we did it in the past" is a good reason
> > > > to do it again.
> > > >
> > > > Is there a strong reason to do it in a 0.8.3 time-frame?
> > > >
> > > > On Tue, Mar 31, 2015 at 7:59 AM, Jun Rao <j...@confluent.io> wrote:
> > > > > (2) Not sure why we can't do this in 0.8.3. We changed the metrics
> > > names
> > > > in
> > > > > 0.8.2 already. Given that we need to share code btw the client and
> > the
> > > > > core, and we need to keep the metrics consistent on the broker, it
> > > seems
> > > > > that we have no choice but to migrate to KM. If so, it seems that
> the
> > > > > sooner that we do this, the better. It is important to give people
> an
> > > > easy
> > > > > path for migration. However, it may not be easy to keep the mbean
> > names
> > > > > exactly the same. For example, YM has hardcoded attributes (e.g.
> > > > > 1-min-rate, 5-min-rate, 15-min-rate, etc for rates) that are not
> > > > available
> > > > > in KM.
> > > > >
> > > > > One benefit out of this migration is that one can get the metrics
> in
> > > the
> > > > > client and the broker in the same way.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Mar 30, 2015 at 9:26 PM, Gwen Shapira <
> gshap...@cloudera.com
> > >
> > > > wrote:
> > > > >
> > > > >> (1) It will be interesting to see what others use for monitoring
> > > > >> integration, to see what is already covered with existing JMX
> > > > >> integrations and what needs special support.
> > > > >>
> > > > >> (2) I think the migration story is more important - this is a
> > > > >> non-compatible change, right? So we can't do it in 0.8.3
> timeframe,
> > it
> > > > >> has to be in 0.9? And we need to figure out how will users
> migrate -
> > > > >> do we just tell everyone "please reconfigure all your monitors
> from
> > > > >> scratch - don't worry, it is worth it?"
> > > > >> I know you keep saying we did it before and our users are used to
> > it,
> > > > >> but I think there are a lot more users now, and some of them have
> > > > >> different compatibility expectations. We probably need to find:
> > > > >> * A least painful way to migrate - can we keep the names of at
> least
> > > > >> most of the metrics intact?
> > > > >> * Good explanation of what users gain from this painful migration
> > > > >> (i.e. more accurate statistics due to gazillion histograms)
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao <j...@confluent.io>
> wrote:
> > > > >> > If we are committed to migrating the broker side metrics to KM
> for
> > > the
> > > > >> next
> > > > >> > release, we will need to (1) have a story on supporting common
> > > > reporters
> > > > >> > (as listed in KAFKA-1930), and (2) see if the current histogram
> > > > support
> > > > >> is
> > > > >> > good enough for measuring things like request time.
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jun
> > > > >> >
> > > > >> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> > > > >> > aaurad...@linkedin.com.invalid> wrote:
> > > > >> >
> > > > >> >> If we do plan to use the network code in client, I think that
> is
> > a
> > > > good
> > > > >> >> reason in favor of migration. It will be unnecessary to have
> > > metrics
> > > > >> from
> > > > >> >> multiple libraries coexist since our users will have to start
> > > > monitoring
> > > > >> >> these new metrics anyway.
> > > > >> >>
> > > > >> >> I also agree with Jay that in multi-tenant clusters people care
> > > about
> > > > >> >> detailed statistics for their own application over global
> > numbers.
> > > > >> >>
> > > > >> >> Based on the arguments so far, I'm +1 for migrating to KM.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >> Aditya
> > > > >> >>
> > > > >> >> ________________________________________
> > > > >> >> From: Jun Rao [j...@confluent.io]
> > > > >> >> Sent: Sunday, March 29, 2015 9:44 AM
> > > > >> >> To: dev@kafka.apache.org
> > > > >> >> Subject: Re: Metrics package discussion
> > > > >> >>
> > > > >> >> There is another thing to consider. We plan to reuse the client
> > > > >> components
> > > > >> >> on the server side over time. For example, as part of the
> > security
> > > > >> work, we
> > > > >> >> are looking into replacing the server side network code with
> the
> > > > client
> > > > >> >> network code (KAFKA-1928). However, the client network already
> > has
> > > > >> metrics
> > > > >> >> based on KM.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >>
> > > > >> >> Jun
> > > > >> >>
> > > > >> >> On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps <
> jay.kr...@gmail.com>
> > > > wrote:
> > > > >> >>
> > > > >> >> > I think Joel's summary is good.
> > > > >> >> >
> > > > >> >> > I'll add a few more points:
> > > > >> >> >
> > > > >> >> > As discussed memory matter a lot if we want to be able to
> give
> > > > >> >> percentiles
> > > > >> >> > at the client or topic level, in which case we will have
> > > thousands
> > > > of
> > > > >> >> them.
> > > > >> >> > If we just do histograms at the global level then it is not a
> > > > concern.
> > > > >> >> The
> > > > >> >> > argument for doing histograms at the client and topic level
> is
> > > that
> > > > >> >> > averages are often very misleading, especially for latency
> > > > >> information or
> > > > >> >> > other asymmetric distributions. Most people who care about
> this
> > > > kind
> > > > >> of
> > > > >> >> > thing would say the same. If you are a user of a multi-tenant
> > > > cluster
> > > > >> >> then
> > > > >> >> > you probably care a lot more about stats for your application
> > or
> > > > your
> > > > >> >> topic
> > > > >> >> > rather than the global, so it could be nice to have
> histograms
> > > for
> > > > >> >> these. I
> > > > >> >> > don't feel super strongly about this.
> > > > >> >> >
> > > > >> >> > The ExponentiallyDecayingSample is internally
> > > > >> >> > a ConcurrentSkipListMap<Double, Long>. This seems to have an
> > > > overhead
> > > > >> of
> > > > >> >> > about 64 bytes per entry. So a 1000 element sample is 64KB.
> For
> > > > global
> > > > >> >> > metrics this is fine, but for granular metrics not workable.
> > > > >> >> >
> > > > >> >> > Two other issues I'm not sure about:
> > > > >> >> >
> > > > >> >> > 1. Is there a way to get metric descriptions into the coda
> hale
> > > JMX
> > > > >> >> output?
> > > > >> >> > One of the really nicest practical things about the new
> client
> > > > >> metrics is
> > > > >> >> > that if you look at them in jconsole each metric has an
> > > associated
> > > > >> >> > description that explains what it means. I think this is a
> nice
> > > > >> usability
> > > > >> >> > thing--it is really hard to know what to make of the current
> > > > metrics
> > > > >> >> > without this kind of documentation and keeping separate docs
> > > > >> up-to-date
> > > > >> >> is
> > > > >> >> > really hard and even if you do it most people won't find it.
> > > > >> >> >
> > > > >> >> > 2. I'm not clear if the sample decay in the histogram is
> > actually
> > > > the
> > > > >> >> same
> > > > >> >> > as for the other stats. It seems like it isn't but this would
> > > make
> > > > >> >> > interpretation quite difficult. In other words if I have N
> > > metrics
> > > > >> >> > including some Histograms some Meters, etc are all these
> > > > measurements
> > > > >> all
> > > > >> >> > taken over the same time window? I actually think they are
> not,
> > > it
> > > > >> looks
> > > > >> >> > like there are different sampling methodologies across. So
> this
> > > > means
> > > > >> if
> > > > >> >> > you have a dashboard that plots these things side by side the
> > > > >> measurement
> > > > >> >> > at a given point in time is not actually comparable across
> > > multiple
> > > > >> >> stats.
> > > > >> >> > Am I confused about this?
> > > > >> >> >
> > > > >> >> > -Jay
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Fri, Mar 27, 2015 at 6:27 PM, Joel Koshy <
> > jjkosh...@gmail.com
> > > >
> > > > >> wrote:
> > > > >> >> >
> > > > >> >> > > For the samples: it will be at least double that estimate I
> > > think
> > > > >> >> > > since the long array contains (eight byte) references to
> the
> > > > actual
> > > > >> >> > > longs, each of which also have some object overhead.
> > > > >> >> > >
> > > > >> >> > > Re: testing: actually, it looks like YM metrics does allow
> > you
> > > to
> > > > >> >> > > drop in your own clock:
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Clock.java
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Meter.java#L36
> > > > >> >> > >
> > > > >> >> > > Not sure if it was mentioned in this (or some recent)
> thread
> > > but
> > > > a
> > > > >> >> > > major motivation in the kafka-common metrics (KM) was
> > absorbing
> > > > API
> > > > >> >> > > changes and even mbean naming conventions. For e.g., in the
> > > early
> > > > >> >> > > stages of 0.8 we picked up YM metrics 3.x but collided with
> > > > client
> > > > >> >> > > apps at LinkedIn which were still on 2.x. We ended up
> > changing
> > > > our
> > > > >> >> > > code to use 2.x in the end. Having our own metrics package
> > > makes
> > > > us
> > > > >> >> > > less vulnerable to these kinds of changes. The multiple
> > version
> > > > >> >> > > collision problem is obviously less of an issue with the
> > broker
> > > > but
> > > > >> we
> > > > >> >> > > are still exposed to possible metric changes in YM metrics.
> > > > >> >> > >
> > > > >> >> > > I'm wondering if we need to weigh too much toward the
> memory
> > > > >> overheads
> > > > >> >> > > of histograms in making a decision here simply because I
> > don't
> > > > think
> > > > >> >> > > we have found them to be an extreme necessity for
> > > > >> >> > > per-clientid/per-partition metrics and they are more
> critical
> > > for
> > > > >> >> > > aggregate (global) metrics.
> > > > >> >> > >
> > > > >> >> > > So it seems the main benefits of switching to KM metrics
> are:
> > > > >> >> > > - Less exposure to YM metrics changes
> > > > >> >> > > - More control over the actual implementation. E.g., there
> is
> > > > >> >> > >   considerable research on implementing
> > > > approximate-but-good-enough
> > > > >> >> > >   histograms/percentiles that we can try out
> > > > >> >> > > - Differences (improvements) from YM metrics such as:
> > > > >> >> > >   - hierarchical sensors
> > > > >> >> > >   - integrated with quota enforcement
> > > > >> >> > >   - mbeans can logically group attributes computed from
> > > different
> > > > >> >> > >     sensors. So there is logical grouping (as opposed to a
> > > > separate
> > > > >> >> > >     mbean per sensor as is the case in YM metrics).
> > > > >> >> > >
> > > > >> >> > > The main disadvantages:
> > > > >> >> > > - Everyone's graphs and alerts will break and need to be
> > > updated
> > > > >> >> > > - Histogram support needs to be tested more/improved
> > > > >> >> > >
> > > > >> >> > > The first disadvantage is a big one but we aren't exactly
> > > immune
> > > > to
> > > > >> >> > > that if we stick with YM.
> > > > >> >> > >
> > > > >> >> > > BTW with KM metrics we should also provide reporters
> > (graphite,
> > > > >> >> > > ganglia) but we probably need to do this anyway since the
> new
> > > > >> clients
> > > > >> >> > > are on KM metrics.
> > > > >> >> > >
> > > > >> >> > > Thanks,
> > > > >> >> > >
> > > > >> >> > > Joel
> > > > >> >> > >
> > > > >> >> > > On Fri, Mar 27, 2015 at 06:48:48PM +0000, Aditya Auradkar
> > > wrote:
> > > > >> >> > > > Adding to what Jay said.
> > > > >> >> > > >
> > > > >> >> > > > The library maintains 1k samples by default. The
> > > UniformSample
> > > > >> has a
> > > > >> >> > > long array so about 8k overhead per histogram. The
> > > > >> >> > > ExponentiallyDecayingSample (which is what we use) has a 16
> > > byte
> > > > >> >> overhead
> > > > >> >> > > per stored sample, so about 16k per histogram. So 10k
> > > histograms
> > > > >> (worst
> > > > >> >> > > case? metrics per partition and client) is about 160MB of
> > > memory
> > > > in
> > > > >> the
> > > > >> >> > > broker.
> > > > >> >> > > >
> > > > >> >> > > > Copying is also a problem. For  percentiles on
> > > HistogramMBean,
> > > > the
> > > > >> >> > > implementation does a copy of the entire array. For e.g.,
> if
> > we
> > > > >> called
> > > > >> >> > > get50Percentile() and get75Percentile(), the entire array
> > would
> > > > get
> > > > >> >> > copied
> > > > >> >> > > twice which is pretty bad if we called each metric on every
> > > > MBean.
> > > > >> >> > > >
> > > > >> >> > > > Another point Joel mentioned is that codahale metrics are
> > > > harder
> > > > >> to
> > > > >> >> > > write tests against because we cannot pass in a Clock.
> > > > >> >> > > >
> > > > >> >> > > > IMO, if a library is preventing us from adding all the
> > > metrics
> > > > >> that
> > > > >> >> we
> > > > >> >> > > want to add and we have a viable alternative, we should
> > replace
> > > > it.
> > > > >> It
> > > > >> >> > > might be short term pain but in the long run we will have
> > more
> > > > >> useful
> > > > >> >> > > graphs.
> > > > >> >> > > > What do people think? I can start a vote thread on this
> > once
> > > we
> > > > >> have
> > > > >> >> a
> > > > >> >> > > couple more opinions.
> > > > >> >> > > >
> > > > >> >> > > > Thanks,
> > > > >> >> > > > Aditya
> > > > >> >> > > > ________________________________________
> > > > >> >> > > > From: Jay Kreps [jay.kr...@gmail.com]
> > > > >> >> > > > Sent: Thursday, March 26, 2015 2:29 PM
> > > > >> >> > > > To: dev@kafka.apache.org
> > > > >> >> > > > Subject: Re: Metrics package discussion
> > > > >> >> > > >
> > > > >> >> > > > Yeah that is a good summary.
> > > > >> >> > > >
> > > > >> >> > > > The reason we don't use histograms heavily in the server
> is
> > > > >> because
> > > > >> >> of
> > > > >> >> > > the
> > > > >> >> > > > memory issues. We originally did use histograms for
> > > everything,
> > > > >> then
> > > > >> >> we
> > > > >> >> > > ran
> > > > >> >> > > > into all these issues, and ripped them out. Whether they
> > are
> > > > >> really
> > > > >> >> > > useful
> > > > >> >> > > > or not, I don't know. Averages can be pretty misleading
> so
> > it
> > > > can
> > > > >> be
> > > > >> >> > nice
> > > > >> >> > > > but I don't know that it is critical.
> > > > >> >> > > >
> > > > >> >> > > > -Jay
> > > > >> >> > > >
> > > > >> >> > > > On Thu, Mar 26, 2015 at 1:58 PM, Aditya Auradkar <
> > > > >> >> > > > aaurad...@linkedin.com.invalid> wrote:
> > > > >> >> > > >
> > > > >> >> > > > > From what I can tell, Histograms don't seem to be used
> > > > >> extensively
> > > > >> >> in
> > > > >> >> > > the
> > > > >> >> > > > > Kafka server (only in RequestChannel.scala) and I'm not
> > > sure
> > > > we
> > > > >> >> need
> > > > >> >> > > them
> > > > >> >> > > > > for per-client metrics. Topic metrics use meters
> > currently.
> > > > >> >> > Migrating
> > > > >> >> > > > > graphing, alerting will be quite a significant effort
> for
> > > all
> > > > >> users
> > > > >> >> > of
> > > > >> >> > > > > Kafka. Do the potential benefits of the new metrics
> > package
> > > > >> >> outweigh
> > > > >> >> > > this
> > > > >> >> > > > > one time migration? In the long run it seems nice to
> > have a
> > > > >> unified
> > > > >> >> > > metrics
> > > > >> >> > > > > package across clients and server. If we were starting
> > out
> > > > from
> > > > >> >> > scratch
> > > > >> >> > > > > without any existing deployments, what decision would
> we
> > > > take?
> > > > >> >> > > > >
> > > > >> >> > > > > I suppose the relative effort in supporting is a useful
> > > data
> > > > >> point
> > > > >> >> in
> > > > >> >> > > this
> > > > >> >> > > > > discussion. We need to throttle based on the current
> byte
> > > > rate
> > > > >> >> which
> > > > >> >> > > should
> > > > >> >> > > > > be a "Meter" in codahale terms. The Meter
> implementation
> > > > uses a
> > > > >> 1,
> > > > >> >> 5
> > > > >> >> > > and 15
> > > > >> >> > > > > minute exponential window moving average. The library
> > also
> > > > does
> > > > >> not
> > > > >> >> > > use the
> > > > >> >> > > > > most recent samples of data for Metered metrics. For
> > > > calculating
> > > > >> >> > > rates, the
> > > > >> >> > > > > EWMA class has a scheduled task that runs every 5
> seconds
> > > and
> > > > >> >> adjusts
> > > > >> >> > > the
> > > > >> >> > > > > rate using the new data accordingly. In that particular
> > > > case, I
> > > > >> >> think
> > > > >> >> > > the
> > > > >> >> > > > > new library is superior since it is more responsive.
> If
> > we
> > > > do
> > > > >> >> choose
> > > > >> >> > > to
> > > > >> >> > > > > remain with Yammer on the server, here are a few ideas
> on
> > > > how to
> > > > >> >> > > support
> > > > >> >> > > > > quotas with relatively less effort.
> > > > >> >> > > > >
> > > > >> >> > > > > - We could have a new type of Meter called "QuotaMeter"
> > > that
> > > > can
> > > > >> >> wrap
> > > > >> >> > > the
> > > > >> >> > > > > existing meter code that follows the same pattern that
> > the
> > > > >> Sensor
> > > > >> >> > does
> > > > >> >> > > in
> > > > >> >> > > > > the new metrics library. This QuotaMeter needs to be
> > > > configured
> > > > >> >> with
> > > > >> >> > a
> > > > >> >> > > > > Quota and it can have a finer grained rate than 1
> minute
> > > (10
> > > > >> >> seconds?
> > > > >> >> > > > > configurable?). Anytime we call "mark()", it update the
> > > > >> underlying
> > > > >> >> > > rates
> > > > >> >> > > > > and throw a QuotaViolationException if required. This
> > class
> > > > can
> > > > >> >> > either
> > > > >> >> > > > > extend Meter or be a separate implementation of the
> > Metric
> > > > >> >> superclass
> > > > >> >> > > that
> > > > >> >> > > > > every metric implements.
> > > > >> >> > > > >
> > > > >> >> > > > > - We can also consider implementing these quotas with
> the
> > > new
> > > > >> >> metrics
> > > > >> >> > > > > package and have these co-exist with the existing
> > metrics.
> > > > This
> > > > >> >> leads
> > > > >> >> > > to 2
> > > > >> >> > > > > metric packages being used on the server, but they are
> > both
> > > > >> pulled
> > > > >> >> in
> > > > >> >> > > as
> > > > >> >> > > > > dependencies anyway. Using this for metrics we can
> quota
> > on
> > > > may
> > > > >> not
> > > > >> >> > be
> > > > >> >> > > a
> > > > >> >> > > > > bad place to start.
> > > > >> >> > > > >
> > > > >> >> > > > > Thanks,
> > > > >> >> > > > > Aditya
> > > > >> >> > > > > ________________________________________
> > > > >> >> > > > > From: Jay Kreps [jay.kr...@gmail.com]
> > > > >> >> > > > > Sent: Wednesday, March 25, 2015 11:08 PM
> > > > >> >> > > > > To: dev@kafka.apache.org
> > > > >> >> > > > > Subject: Re: Metrics package discussion
> > > > >> >> > > > >
> > > > >> >> > > > > Here was my understanding of the issue last time.
> > > > >> >> > > > >
> > > > >> >> > > > > The yammer metrics use a random sample of requests to
> > > > estimate
> > > > >> the
> > > > >> >> > > > > histogram. This allocates a fairly large array of longs
> > > > (their
> > > > >> >> values
> > > > >> >> > > are
> > > > >> >> > > > > longs rather than floats). A reasonable sample might be
> > 8k
> > > > >> entries
> > > > >> >> > > which
> > > > >> >> > > > > would give about 64KB per histogram. There are bounds
> on
> > > > >> accuracy,
> > > > >> >> > but
> > > > >> >> > > they
> > > > >> >> > > > > are only probabilistic. I.e. if you try to get 99% < 5
> ms
> > > of
> > > > >> >> > > inaccuracy,
> > > > >> >> > > > > you will 1% of the time get more than this. This is
> okay
> > > but
> > > > if
> > > > >> you
> > > > >> >> > > try to
> > > > >> >> > > > > alert, in which you realize that being wrong 1% of the
> > time
> > > > is a
> > > > >> >> lot
> > > > >> >> > > if you
> > > > >> >> > > > > are computing stats every second continuously on many
> > > metrics
> > > > >> >> (i.e. 1
> > > > >> >> > > in
> > > > >> >> > > > > 100 estimates will be outside you bound). This array is
> > > > copied
> > > > >> in
> > > > >> >> > full
> > > > >> >> > > > > every time you check the metric which is the other
> cause
> > of
> > > > the
> > > > >> >> > memory
> > > > >> >> > > > > pressure.
> > > > >> >> > > > >
> > > > >> >> > > > > The better approach to histograms is to calculate
> buckets
> > > > >> >> boundaries
> > > > >> >> > > and
> > > > >> >> > > > > record arbitrarily many values in those buckets. A
> simple
> > > > >> bucketing
> > > > >> >> > > > > approach for latency would be 0, 5ms, 10ms, 15ms, etc,
> > and
> > > > you
> > > > >> just
> > > > >> >> > > count
> > > > >> >> > > > > how many fall in each bucket. Your precision is
> > > > >> deterministically
> > > > >> >> > > bounded
> > > > >> >> > > > > by the bucket boundaries, so if you had 5ms buckets you
> > > would
> > > > >> never
> > > > >> >> > > have
> > > > >> >> > > > > more than 5ms loss of precision. By using non-uniform
> > > bucket
> > > > >> sizes
> > > > >> >> > you
> > > > >> >> > > can
> > > > >> >> > > > > make this work even better (e.g. give ~1ms precision
> for
> > > > >> latencies
> > > > >> >> in
> > > > >> >> > > the
> > > > >> >> > > > > 1ms range, but give only 1 second precision for
> latencies
> > > in
> > > > >> the 30
> > > > >> >> > > second
> > > > >> >> > > > > range). That is what is implemented in that metrics
> > > package.
> > > > >> >> > > > >
> > > > >> >> > > > > I think this bucketing approach is popular now. There
> is
> > a
> > > > whole
> > > > >> >> "HDR
> > > > >> >> > > > > histogram" library that gives lots of different
> bucketing
> > > > >> methods
> > > > >> >> and
> > > > >> >> > > > > implements dynamic resizing so you don't have to
> specify
> > an
> > > > >> upper
> > > > >> >> > > bound.
> > > > >> >> > > > >  https://github.com/HdrHistogram/HdrHistogram
> > > > >> >> > > > >
> > > > >> >> > > > > Whether this matters depends entirely if you want
> > > histograms
> > > > >> broken
> > > > >> >> > > down at
> > > > >> >> > > > > the client, topic, partition, or broker level or just
> > want
> > > > >> overall
> > > > >> >> > > metrics.
> > > > >> >> > > > > If we just want per sever aggregates for histograms
> then
> > I
> > > > think
> > > > >> >> the
> > > > >> >> > > memory
> > > > >> >> > > > > usage is not a huge issue. If you want a histogram per
> > > topic
> > > > or
> > > > >> >> > client
> > > > >> >> > > or
> > > > >> >> > > > > partition and have 10k of these then that is where you
> > > start
> > > > >> >> talking
> > > > >> >> > > like
> > > > >> >> > > > > 1GB of memory with the yammer package, which is what we
> > hit
> > > > last
> > > > >> >> > time.
> > > > >> >> > > > > Getting percentiles on the client level is nice,
> > > percentiles
> > > > are
> > > > >> >> > > definitely
> > > > >> >> > > > > better than averages, but I'm not sure it is required.
> > > > >> >> > > > >
> > > > >> >> > > > > -Jay
> > > > >> >> > > > >
> > > > >> >> > > > > On Wed, Mar 25, 2015 at 9:43 PM, Neha Narkhede <
> > > > >> n...@confluent.io>
> > > > >> >> > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Aditya,
> > > > >> >> > > > > >
> > > > >> >> > > > > > If we are doing a deep dive, one of the things to
> > > > investigate
> > > > >> >> would
> > > > >> >> > > be
> > > > >> >> > > > > > memory/GC performance. IIRC, when I was looking into
> > > > codahale
> > > > >> at
> > > > >> >> > > > > LinkedIn,
> > > > >> >> > > > > > I remember it having quite a few memory management
> and
> > GC
> > > > >> issues
> > > > >> >> > > while
> > > > >> >> > > > > > using histograms. In comparison, histograms in the
> new
> > > > metrics
> > > > >> >> > > package
> > > > >> >> > > > > > aren't very well tested.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Thanks,
> > > > >> >> > > > > > Neha
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Wed, Mar 25, 2015 at 8:25 AM, Aditya Auradkar <
> > > > >> >> > > > > > aaurad...@linkedin.com.invalid> wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > > > Hey everyone,
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Picking up this discussion after yesterdays KIP
> > > hangout.
> > > > For
> > > > >> >> > > anyone who
> > > > >> >> > > > > > > did not join the meeting, we have 2 different
> metrics
> > > > >> packages
> > > > >> >> > > being
> > > > >> >> > > > > used
> > > > >> >> > > > > > > by the clients (custom package) and the server
> > > > (codahale).
> > > > >> We
> > > > >> >> are
> > > > >> >> > > > > > > discussing whether to migrate the server to the new
> > > > package.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > What information do we need in order to make a
> > > decision?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Some pros of the new package:
> > > > >> >> > > > > > > - Using the most recent information by combining
> data
> > > > from
> > > > >> >> > > previous and
> > > > >> >> > > > > > > current samples. I'm not sure how codahale does
> this
> > so
> > > > I'll
> > > > >> >> > > > > investigate.
> > > > >> >> > > > > > > - We can quota on anything we measure. This is
> pretty
> > > > cool
> > > > >> IMO.
> > > > >> >> > > I've
> > > > >> >> > > > > > > investigate the feasibility of adding this feature
> in
> > > > >> codahale.
> > > > >> >> > > > > > > - Hierarchical metrics. For example: we can define
> a
> > > > sensor
> > > > >> for
> > > > >> >> > > overall
> > > > >> >> > > > > > > bytes-in/bytes-out and also per-client. Updating
> the
> > > > client
> > > > >> >> > sensor
> > > > >> >> > > will
> > > > >> >> > > > > > > cause the global byte rate sensor to get modified
> > too.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > What are some of the issues with codahale? One
> > previous
> > > > >> >> > discussion
> > > > >> >> > > > > > > mentions high memory usage but I don't have any
> > > > experience
> > > > >> with
> > > > >> >> > it
> > > > >> >> > > > > > myself.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Thanks,
> > > > >> >> > > > > > > Aditya
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > > --
> > > > >> >> > > > > > Thanks,
> > > > >> >> > > > > > Neha
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
>

Re: Metrics package discussion

Reply via email to