Re: Metrics package discussion

2015-04-24 Thread Jun Rao
If so, it seems that
> the
> > > > > sooner that we do this, the better. It is important to give people
> an
> > > > easy
> > > > > path for migration. However, it may not be easy to keep the mbean
> > names
> > > > > exactly the same. For example, YM has hardcoded attributes (e.g.
> > > > > 1-min-rate, 5-min-rate, 15-min-rate, etc for rates) that are not
> > > > available
> > > > > in KM.
> > > > >
> > > > > One benefit out of this migration is that one can get the metrics
> in
> > > the
> > > > > client and the broker in the same way.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Mar 30, 2015 at 9:26 PM, Gwen Shapira <
> gshap...@cloudera.com
> > >
> > > > wrote:
> > > > >
> > > > >> (1) It will be interesting to see what others use for monitoring
> > > > >> integration, to see what is already covered with existing JMX
> > > > >> integrations and what needs special support.
> > > > >>
> > > > >> (2) I think the migration story is more important - this is a
> > > > >> non-compatible change, right? So we can't do it in 0.8.3
> timeframe,
> > it
> > > > >> has to be in 0.9? And we need to figure out how will users
> migrate -
> > > > >> do we just tell everyone "please reconfigure all your monitors
> from
> > > > >> scratch - don't worry, it is worth it?"
> > > > >> I know you keep saying we did it before and our users are used to
> > it,
> > > > >> but I think there are a lot more users now, and some of them have
> > > > >> different compatibility expectations. We probably need to find:
> > > > >> * A least painful way to migrate - can we keep the names of at
> least
> > > > >> most of the metrics intact?
> > > > >> * Good explanation of what users gain from this painful migration
> > > > >> (i.e. more accurate statistics due to gazillion histograms)
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao 
> wrote:
> > > > >> > If we are committed to migrating the broker side metrics to KM
> for
> > > the
> > > > >> next
> > > > >> > release, we will need to (1) have a story on supporting common
> > > > reporters
> > > > >> > (as listed in KAFKA-1930), and (2) see if the current histogram
> > > > support
> > > > >> is
> > > > >> > good enough for measuring things like request time.
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jun
> > > > >> >
> > > > >> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> > > > >> > aaurad...@linkedin.com.invalid> wrote:
> > > > >> >
> > > > >> >> If we do plan to use the network code in client, I think that
> is
> > a
> > > > good
> > > > >> >> reason in favor of migration. It will be unnecessary to have
> > > metrics
> > > > >> from
> > > > >> >> multiple libraries coexist since our users will have to start
> > > > monitoring
> > > > >> >> these new metrics anyway.
> > > > >> >>
> > > > >> >> I also agree with Jay that in multi-tenant clusters people care
> > > about
> > > > >> >> detailed statistics for their own application over global
> > numbers.
> > > > >> >>
> > > > >> >> Based on the arguments so far, I'm +1 for migrating to KM.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >> Aditya
> > > > >> >>
> > > > >> >> 
> > > > >> >> From: Jun Rao [j...@confluent.io]
> > > > >> >> Sent: Sunday, March 29, 2015 9:44 AM
> > > > >> >> To: dev@kafka.apache.org
> > > > 

Re: Metrics package discussion

2015-04-21 Thread Otis Gospodnetic
> it
> > > >> has to be in 0.9? And we need to figure out how will users migrate -
> > > >> do we just tell everyone "please reconfigure all your monitors from
> > > >> scratch - don't worry, it is worth it?"
> > > >> I know you keep saying we did it before and our users are used to
> it,
> > > >> but I think there are a lot more users now, and some of them have
> > > >> different compatibility expectations. We probably need to find:
> > > >> * A least painful way to migrate - can we keep the names of at least
> > > >> most of the metrics intact?
> > > >> * Good explanation of what users gain from this painful migration
> > > >> (i.e. more accurate statistics due to gazillion histograms)
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao  wrote:
> > > >> > If we are committed to migrating the broker side metrics to KM for
> > the
> > > >> next
> > > >> > release, we will need to (1) have a story on supporting common
> > > reporters
> > > >> > (as listed in KAFKA-1930), and (2) see if the current histogram
> > > support
> > > >> is
> > > >> > good enough for measuring things like request time.
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jun
> > > >> >
> > > >> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> > > >> > aaurad...@linkedin.com.invalid> wrote:
> > > >> >
> > > >> >> If we do plan to use the network code in client, I think that is
> a
> > > good
> > > >> >> reason in favor of migration. It will be unnecessary to have
> > metrics
> > > >> from
> > > >> >> multiple libraries coexist since our users will have to start
> > > monitoring
> > > >> >> these new metrics anyway.
> > > >> >>
> > > >> >> I also agree with Jay that in multi-tenant clusters people care
> > about
> > > >> >> detailed statistics for their own application over global
> numbers.
> > > >> >>
> > > >> >> Based on the arguments so far, I'm +1 for migrating to KM.
> > > >> >>
> > > >> >> Thanks,
> > > >> >> Aditya
> > > >> >>
> > > >> >> 
> > > >> >> From: Jun Rao [j...@confluent.io]
> > > >> >> Sent: Sunday, March 29, 2015 9:44 AM
> > > >> >> To: dev@kafka.apache.org
> > > >> >> Subject: Re: Metrics package discussion
> > > >> >>
> > > >> >> There is another thing to consider. We plan to reuse the client
> > > >> components
> > > >> >> on the server side over time. For example, as part of the
> security
> > > >> work, we
> > > >> >> are looking into replacing the server side network code with the
> > > client
> > > >> >> network code (KAFKA-1928). However, the client network already
> has
> > > >> metrics
> > > >> >> based on KM.
> > > >> >>
> > > >> >> Thanks,
> > > >> >>
> > > >> >> Jun
> > > >> >>
> > > >> >> On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps 
> > > wrote:
> > > >> >>
> > > >> >> > I think Joel's summary is good.
> > > >> >> >
> > > >> >> > I'll add a few more points:
> > > >> >> >
> > > >> >> > As discussed memory matter a lot if we want to be able to give
> > > >> >> percentiles
> > > >> >> > at the client or topic level, in which case we will have
> > thousands
> > > of
> > > >> >> them.
> > > >> >> > If we just do histograms at the global level then it is not a
> > > concern.
> > > >> >> The
> > > >> >> > argument for doing histograms at the client and topic level is
> > that
> > > >> >> > average

Re: Metrics package discussion

2015-03-31 Thread Steven Wu
> My main concern is that we don't do the migration in 0.8.3, we will be
left
with some metrics in YM format and some others in KM format (as we start
sharing client code on the broker). This is probably a worse situation to
be in.

+1. I am not sure how our servo adaptor will work if there are two formats
for metrics? unless there is an easy way to check the format (YM/KM).


On Tue, Mar 31, 2015 at 9:40 AM, Jun Rao  wrote:

> (2) The metrics are clearly part of the client API and we are not changing
> that (at least for the new client). Arguably, the metrics are also part of
> the broker side API. However, since they affect fewer parties (mostly just
> the Kafka admins), it may be easier to make those changes.
>
> My main concern is that we don't do the migration in 0.8.3, we will be left
> with some metrics in YM format and some others in KM format (as we start
> sharing client code on the broker). This is probably a worse situation to
> be in.
>
> Thanks,
>
> Jun
>
> On Tue, Mar 31, 2015 at 9:26 AM, Gwen Shapira 
> wrote:
>
> > (2) I believe we agreed that our metrics are a public API. I believe
> > we also agree we don't break API in minor releases. So, it seems
> > obvious to me that we can't make breaking changes to metrics in minor
> > releases. I'm not convinced "we did it in the past" is a good reason
> > to do it again.
> >
> > Is there a strong reason to do it in a 0.8.3 time-frame?
> >
> > On Tue, Mar 31, 2015 at 7:59 AM, Jun Rao  wrote:
> > > (2) Not sure why we can't do this in 0.8.3. We changed the metrics
> names
> > in
> > > 0.8.2 already. Given that we need to share code btw the client and the
> > > core, and we need to keep the metrics consistent on the broker, it
> seems
> > > that we have no choice but to migrate to KM. If so, it seems that the
> > > sooner that we do this, the better. It is important to give people an
> > easy
> > > path for migration. However, it may not be easy to keep the mbean names
> > > exactly the same. For example, YM has hardcoded attributes (e.g.
> > > 1-min-rate, 5-min-rate, 15-min-rate, etc for rates) that are not
> > available
> > > in KM.
> > >
> > > One benefit out of this migration is that one can get the metrics in
> the
> > > client and the broker in the same way.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Mar 30, 2015 at 9:26 PM, Gwen Shapira 
> > wrote:
> > >
> > >> (1) It will be interesting to see what others use for monitoring
> > >> integration, to see what is already covered with existing JMX
> > >> integrations and what needs special support.
> > >>
> > >> (2) I think the migration story is more important - this is a
> > >> non-compatible change, right? So we can't do it in 0.8.3 timeframe, it
> > >> has to be in 0.9? And we need to figure out how will users migrate -
> > >> do we just tell everyone "please reconfigure all your monitors from
> > >> scratch - don't worry, it is worth it?"
> > >> I know you keep saying we did it before and our users are used to it,
> > >> but I think there are a lot more users now, and some of them have
> > >> different compatibility expectations. We probably need to find:
> > >> * A least painful way to migrate - can we keep the names of at least
> > >> most of the metrics intact?
> > >> * Good explanation of what users gain from this painful migration
> > >> (i.e. more accurate statistics due to gazillion histograms)
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao  wrote:
> > >> > If we are committed to migrating the broker side metrics to KM for
> the
> > >> next
> > >> > release, we will need to (1) have a story on supporting common
> > reporters
> > >> > (as listed in KAFKA-1930), and (2) see if the current histogram
> > support
> > >> is
> > >> > good enough for measuring things like request time.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> > >> > aaurad...@linkedin.com.invalid> wrote:
> > >> >
> > >> >> If we do plan to use the network code in client, I think that is a
> > good
> > >

Re: Metrics package discussion

2015-03-31 Thread Jun Rao
(2) The metrics are clearly part of the client API and we are not changing
that (at least for the new client). Arguably, the metrics are also part of
the broker side API. However, since they affect fewer parties (mostly just
the Kafka admins), it may be easier to make those changes.

My main concern is that we don't do the migration in 0.8.3, we will be left
with some metrics in YM format and some others in KM format (as we start
sharing client code on the broker). This is probably a worse situation to
be in.

Thanks,

Jun

On Tue, Mar 31, 2015 at 9:26 AM, Gwen Shapira  wrote:

> (2) I believe we agreed that our metrics are a public API. I believe
> we also agree we don't break API in minor releases. So, it seems
> obvious to me that we can't make breaking changes to metrics in minor
> releases. I'm not convinced "we did it in the past" is a good reason
> to do it again.
>
> Is there a strong reason to do it in a 0.8.3 time-frame?
>
> On Tue, Mar 31, 2015 at 7:59 AM, Jun Rao  wrote:
> > (2) Not sure why we can't do this in 0.8.3. We changed the metrics names
> in
> > 0.8.2 already. Given that we need to share code btw the client and the
> > core, and we need to keep the metrics consistent on the broker, it seems
> > that we have no choice but to migrate to KM. If so, it seems that the
> > sooner that we do this, the better. It is important to give people an
> easy
> > path for migration. However, it may not be easy to keep the mbean names
> > exactly the same. For example, YM has hardcoded attributes (e.g.
> > 1-min-rate, 5-min-rate, 15-min-rate, etc for rates) that are not
> available
> > in KM.
> >
> > One benefit out of this migration is that one can get the metrics in the
> > client and the broker in the same way.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Mar 30, 2015 at 9:26 PM, Gwen Shapira 
> wrote:
> >
> >> (1) It will be interesting to see what others use for monitoring
> >> integration, to see what is already covered with existing JMX
> >> integrations and what needs special support.
> >>
> >> (2) I think the migration story is more important - this is a
> >> non-compatible change, right? So we can't do it in 0.8.3 timeframe, it
> >> has to be in 0.9? And we need to figure out how will users migrate -
> >> do we just tell everyone "please reconfigure all your monitors from
> >> scratch - don't worry, it is worth it?"
> >> I know you keep saying we did it before and our users are used to it,
> >> but I think there are a lot more users now, and some of them have
> >> different compatibility expectations. We probably need to find:
> >> * A least painful way to migrate - can we keep the names of at least
> >> most of the metrics intact?
> >> * Good explanation of what users gain from this painful migration
> >> (i.e. more accurate statistics due to gazillion histograms)
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao  wrote:
> >> > If we are committed to migrating the broker side metrics to KM for the
> >> next
> >> > release, we will need to (1) have a story on supporting common
> reporters
> >> > (as listed in KAFKA-1930), and (2) see if the current histogram
> support
> >> is
> >> > good enough for measuring things like request time.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> >> > aaurad...@linkedin.com.invalid> wrote:
> >> >
> >> >> If we do plan to use the network code in client, I think that is a
> good
> >> >> reason in favor of migration. It will be unnecessary to have metrics
> >> from
> >> >> multiple libraries coexist since our users will have to start
> monitoring
> >> >> these new metrics anyway.
> >> >>
> >> >> I also agree with Jay that in multi-tenant clusters people care about
> >> >> detailed statistics for their own application over global numbers.
> >> >>
> >> >> Based on the arguments so far, I'm +1 for migrating to KM.
> >> >>
> >> >> Thanks,
> >> >> Aditya
> >> >>
> >> >> 
> >> >> From: Jun Rao [j...@confluent.io]
> >> >> Sent: Sunday, March 29, 2015 9:44 AM
> >> >> To: dev@kafka.apache.org
> &g

Re: Metrics package discussion

2015-03-31 Thread Gwen Shapira
(2) I believe we agreed that our metrics are a public API. I believe
we also agree we don't break API in minor releases. So, it seems
obvious to me that we can't make breaking changes to metrics in minor
releases. I'm not convinced "we did it in the past" is a good reason
to do it again.

Is there a strong reason to do it in a 0.8.3 time-frame?

On Tue, Mar 31, 2015 at 7:59 AM, Jun Rao  wrote:
> (2) Not sure why we can't do this in 0.8.3. We changed the metrics names in
> 0.8.2 already. Given that we need to share code btw the client and the
> core, and we need to keep the metrics consistent on the broker, it seems
> that we have no choice but to migrate to KM. If so, it seems that the
> sooner that we do this, the better. It is important to give people an easy
> path for migration. However, it may not be easy to keep the mbean names
> exactly the same. For example, YM has hardcoded attributes (e.g.
> 1-min-rate, 5-min-rate, 15-min-rate, etc for rates) that are not available
> in KM.
>
> One benefit out of this migration is that one can get the metrics in the
> client and the broker in the same way.
>
> Thanks,
>
> Jun
>
> On Mon, Mar 30, 2015 at 9:26 PM, Gwen Shapira  wrote:
>
>> (1) It will be interesting to see what others use for monitoring
>> integration, to see what is already covered with existing JMX
>> integrations and what needs special support.
>>
>> (2) I think the migration story is more important - this is a
>> non-compatible change, right? So we can't do it in 0.8.3 timeframe, it
>> has to be in 0.9? And we need to figure out how will users migrate -
>> do we just tell everyone "please reconfigure all your monitors from
>> scratch - don't worry, it is worth it?"
>> I know you keep saying we did it before and our users are used to it,
>> but I think there are a lot more users now, and some of them have
>> different compatibility expectations. We probably need to find:
>> * A least painful way to migrate - can we keep the names of at least
>> most of the metrics intact?
>> * Good explanation of what users gain from this painful migration
>> (i.e. more accurate statistics due to gazillion histograms)
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao  wrote:
>> > If we are committed to migrating the broker side metrics to KM for the
>> next
>> > release, we will need to (1) have a story on supporting common reporters
>> > (as listed in KAFKA-1930), and (2) see if the current histogram support
>> is
>> > good enough for measuring things like request time.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
>> > aaurad...@linkedin.com.invalid> wrote:
>> >
>> >> If we do plan to use the network code in client, I think that is a good
>> >> reason in favor of migration. It will be unnecessary to have metrics
>> from
>> >> multiple libraries coexist since our users will have to start monitoring
>> >> these new metrics anyway.
>> >>
>> >> I also agree with Jay that in multi-tenant clusters people care about
>> >> detailed statistics for their own application over global numbers.
>> >>
>> >> Based on the arguments so far, I'm +1 for migrating to KM.
>> >>
>> >> Thanks,
>> >> Aditya
>> >>
>> >> 
>> >> From: Jun Rao [j...@confluent.io]
>> >> Sent: Sunday, March 29, 2015 9:44 AM
>> >> To: dev@kafka.apache.org
>> >> Subject: Re: Metrics package discussion
>> >>
>> >> There is another thing to consider. We plan to reuse the client
>> components
>> >> on the server side over time. For example, as part of the security
>> work, we
>> >> are looking into replacing the server side network code with the client
>> >> network code (KAFKA-1928). However, the client network already has
>> metrics
>> >> based on KM.
>> >>
>> >> Thanks,
>> >>
>> >> Jun
>> >>
>> >> On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps  wrote:
>> >>
>> >> > I think Joel's summary is good.
>> >> >
>> >> > I'll add a few more points:
>> >> >
>> >> > As discussed memory matter a lot if we want to be able to give
>> >> percentiles
>> >> > at the client or topic level, in which case we will have

Re: Metrics package discussion

2015-03-31 Thread Jun Rao
(2) Not sure why we can't do this in 0.8.3. We changed the metrics names in
0.8.2 already. Given that we need to share code btw the client and the
core, and we need to keep the metrics consistent on the broker, it seems
that we have no choice but to migrate to KM. If so, it seems that the
sooner that we do this, the better. It is important to give people an easy
path for migration. However, it may not be easy to keep the mbean names
exactly the same. For example, YM has hardcoded attributes (e.g.
1-min-rate, 5-min-rate, 15-min-rate, etc for rates) that are not available
in KM.

One benefit out of this migration is that one can get the metrics in the
client and the broker in the same way.

Thanks,

Jun

On Mon, Mar 30, 2015 at 9:26 PM, Gwen Shapira  wrote:

> (1) It will be interesting to see what others use for monitoring
> integration, to see what is already covered with existing JMX
> integrations and what needs special support.
>
> (2) I think the migration story is more important - this is a
> non-compatible change, right? So we can't do it in 0.8.3 timeframe, it
> has to be in 0.9? And we need to figure out how will users migrate -
> do we just tell everyone "please reconfigure all your monitors from
> scratch - don't worry, it is worth it?"
> I know you keep saying we did it before and our users are used to it,
> but I think there are a lot more users now, and some of them have
> different compatibility expectations. We probably need to find:
> * A least painful way to migrate - can we keep the names of at least
> most of the metrics intact?
> * Good explanation of what users gain from this painful migration
> (i.e. more accurate statistics due to gazillion histograms)
>
>
>
>
>
>
> On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao  wrote:
> > If we are committed to migrating the broker side metrics to KM for the
> next
> > release, we will need to (1) have a story on supporting common reporters
> > (as listed in KAFKA-1930), and (2) see if the current histogram support
> is
> > good enough for measuring things like request time.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> > aaurad...@linkedin.com.invalid> wrote:
> >
> >> If we do plan to use the network code in client, I think that is a good
> >> reason in favor of migration. It will be unnecessary to have metrics
> from
> >> multiple libraries coexist since our users will have to start monitoring
> >> these new metrics anyway.
> >>
> >> I also agree with Jay that in multi-tenant clusters people care about
> >> detailed statistics for their own application over global numbers.
> >>
> >> Based on the arguments so far, I'm +1 for migrating to KM.
> >>
> >> Thanks,
> >> Aditya
> >>
> >> 
> >> From: Jun Rao [j...@confluent.io]
> >> Sent: Sunday, March 29, 2015 9:44 AM
> >> To: dev@kafka.apache.org
> >> Subject: Re: Metrics package discussion
> >>
> >> There is another thing to consider. We plan to reuse the client
> components
> >> on the server side over time. For example, as part of the security
> work, we
> >> are looking into replacing the server side network code with the client
> >> network code (KAFKA-1928). However, the client network already has
> metrics
> >> based on KM.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps  wrote:
> >>
> >> > I think Joel's summary is good.
> >> >
> >> > I'll add a few more points:
> >> >
> >> > As discussed memory matter a lot if we want to be able to give
> >> percentiles
> >> > at the client or topic level, in which case we will have thousands of
> >> them.
> >> > If we just do histograms at the global level then it is not a concern.
> >> The
> >> > argument for doing histograms at the client and topic level is that
> >> > averages are often very misleading, especially for latency
> information or
> >> > other asymmetric distributions. Most people who care about this kind
> of
> >> > thing would say the same. If you are a user of a multi-tenant cluster
> >> then
> >> > you probably care a lot more about stats for your application or your
> >> topic
> >> > rather than the global, so it could be nice to have histograms for
> >> these. I
> >> > don't feel super strongly about this.
> >&

Re: Metrics package discussion

2015-03-30 Thread Gwen Shapira
(1) It will be interesting to see what others use for monitoring
integration, to see what is already covered with existing JMX
integrations and what needs special support.

(2) I think the migration story is more important - this is a
non-compatible change, right? So we can't do it in 0.8.3 timeframe, it
has to be in 0.9? And we need to figure out how will users migrate -
do we just tell everyone "please reconfigure all your monitors from
scratch - don't worry, it is worth it?"
I know you keep saying we did it before and our users are used to it,
but I think there are a lot more users now, and some of them have
different compatibility expectations. We probably need to find:
* A least painful way to migrate - can we keep the names of at least
most of the metrics intact?
* Good explanation of what users gain from this painful migration
(i.e. more accurate statistics due to gazillion histograms)






On Mon, Mar 30, 2015 at 6:29 PM, Jun Rao  wrote:
> If we are committed to migrating the broker side metrics to KM for the next
> release, we will need to (1) have a story on supporting common reporters
> (as listed in KAFKA-1930), and (2) see if the current histogram support is
> good enough for measuring things like request time.
>
> Thanks,
>
> Jun
>
> On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
> aaurad...@linkedin.com.invalid> wrote:
>
>> If we do plan to use the network code in client, I think that is a good
>> reason in favor of migration. It will be unnecessary to have metrics from
>> multiple libraries coexist since our users will have to start monitoring
>> these new metrics anyway.
>>
>> I also agree with Jay that in multi-tenant clusters people care about
>> detailed statistics for their own application over global numbers.
>>
>> Based on the arguments so far, I'm +1 for migrating to KM.
>>
>> Thanks,
>> Aditya
>>
>> ____________________
>> From: Jun Rao [j...@confluent.io]
>> Sent: Sunday, March 29, 2015 9:44 AM
>> To: dev@kafka.apache.org
>> Subject: Re: Metrics package discussion
>>
>> There is another thing to consider. We plan to reuse the client components
>> on the server side over time. For example, as part of the security work, we
>> are looking into replacing the server side network code with the client
>> network code (KAFKA-1928). However, the client network already has metrics
>> based on KM.
>>
>> Thanks,
>>
>> Jun
>>
>> On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps  wrote:
>>
>> > I think Joel's summary is good.
>> >
>> > I'll add a few more points:
>> >
>> > As discussed memory matter a lot if we want to be able to give
>> percentiles
>> > at the client or topic level, in which case we will have thousands of
>> them.
>> > If we just do histograms at the global level then it is not a concern.
>> The
>> > argument for doing histograms at the client and topic level is that
>> > averages are often very misleading, especially for latency information or
>> > other asymmetric distributions. Most people who care about this kind of
>> > thing would say the same. If you are a user of a multi-tenant cluster
>> then
>> > you probably care a lot more about stats for your application or your
>> topic
>> > rather than the global, so it could be nice to have histograms for
>> these. I
>> > don't feel super strongly about this.
>> >
>> > The ExponentiallyDecayingSample is internally
>> > a ConcurrentSkipListMap. This seems to have an overhead of
>> > about 64 bytes per entry. So a 1000 element sample is 64KB. For global
>> > metrics this is fine, but for granular metrics not workable.
>> >
>> > Two other issues I'm not sure about:
>> >
>> > 1. Is there a way to get metric descriptions into the coda hale JMX
>> output?
>> > One of the really nicest practical things about the new client metrics is
>> > that if you look at them in jconsole each metric has an associated
>> > description that explains what it means. I think this is a nice usability
>> > thing--it is really hard to know what to make of the current metrics
>> > without this kind of documentation and keeping separate docs up-to-date
>> is
>> > really hard and even if you do it most people won't find it.
>> >
>> > 2. I'm not clear if the sample decay in the histogram is actually the
>> same
>> > as for the other stats. It seems like it isn't but this would make
>> > interpretation 

Re: Metrics package discussion

2015-03-30 Thread Jun Rao
If we are committed to migrating the broker side metrics to KM for the next
release, we will need to (1) have a story on supporting common reporters
(as listed in KAFKA-1930), and (2) see if the current histogram support is
good enough for measuring things like request time.

Thanks,

Jun

On Mon, Mar 30, 2015 at 3:03 PM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> If we do plan to use the network code in client, I think that is a good
> reason in favor of migration. It will be unnecessary to have metrics from
> multiple libraries coexist since our users will have to start monitoring
> these new metrics anyway.
>
> I also agree with Jay that in multi-tenant clusters people care about
> detailed statistics for their own application over global numbers.
>
> Based on the arguments so far, I'm +1 for migrating to KM.
>
> Thanks,
> Aditya
>
> 
> From: Jun Rao [j...@confluent.io]
> Sent: Sunday, March 29, 2015 9:44 AM
> To: dev@kafka.apache.org
> Subject: Re: Metrics package discussion
>
> There is another thing to consider. We plan to reuse the client components
> on the server side over time. For example, as part of the security work, we
> are looking into replacing the server side network code with the client
> network code (KAFKA-1928). However, the client network already has metrics
> based on KM.
>
> Thanks,
>
> Jun
>
> On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps  wrote:
>
> > I think Joel's summary is good.
> >
> > I'll add a few more points:
> >
> > As discussed memory matter a lot if we want to be able to give
> percentiles
> > at the client or topic level, in which case we will have thousands of
> them.
> > If we just do histograms at the global level then it is not a concern.
> The
> > argument for doing histograms at the client and topic level is that
> > averages are often very misleading, especially for latency information or
> > other asymmetric distributions. Most people who care about this kind of
> > thing would say the same. If you are a user of a multi-tenant cluster
> then
> > you probably care a lot more about stats for your application or your
> topic
> > rather than the global, so it could be nice to have histograms for
> these. I
> > don't feel super strongly about this.
> >
> > The ExponentiallyDecayingSample is internally
> > a ConcurrentSkipListMap. This seems to have an overhead of
> > about 64 bytes per entry. So a 1000 element sample is 64KB. For global
> > metrics this is fine, but for granular metrics not workable.
> >
> > Two other issues I'm not sure about:
> >
> > 1. Is there a way to get metric descriptions into the coda hale JMX
> output?
> > One of the really nicest practical things about the new client metrics is
> > that if you look at them in jconsole each metric has an associated
> > description that explains what it means. I think this is a nice usability
> > thing--it is really hard to know what to make of the current metrics
> > without this kind of documentation and keeping separate docs up-to-date
> is
> > really hard and even if you do it most people won't find it.
> >
> > 2. I'm not clear if the sample decay in the histogram is actually the
> same
> > as for the other stats. It seems like it isn't but this would make
> > interpretation quite difficult. In other words if I have N metrics
> > including some Histograms some Meters, etc are all these measurements all
> > taken over the same time window? I actually think they are not, it looks
> > like there are different sampling methodologies across. So this means if
> > you have a dashboard that plots these things side by side the measurement
> > at a given point in time is not actually comparable across multiple
> stats.
> > Am I confused about this?
> >
> > -Jay
> >
> >
> > On Fri, Mar 27, 2015 at 6:27 PM, Joel Koshy  wrote:
> >
> > > For the samples: it will be at least double that estimate I think
> > > since the long array contains (eight byte) references to the actual
> > > longs, each of which also have some object overhead.
> > >
> > > Re: testing: actually, it looks like YM metrics does allow you to
> > > drop in your own clock:
> > >
> > >
> >
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Clock.java
> > >
> > >
> >
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Meter.java#L36
> > >
> > > 

RE: Metrics package discussion

2015-03-30 Thread Aditya Auradkar
If we do plan to use the network code in client, I think that is a good reason 
in favor of migration. It will be unnecessary to have metrics from multiple 
libraries coexist since our users will have to start monitoring these new 
metrics anyway.

I also agree with Jay that in multi-tenant clusters people care about detailed 
statistics for their own application over global numbers. 

Based on the arguments so far, I'm +1 for migrating to KM.

Thanks,
Aditya


From: Jun Rao [j...@confluent.io]
Sent: Sunday, March 29, 2015 9:44 AM
To: dev@kafka.apache.org
Subject: Re: Metrics package discussion

There is another thing to consider. We plan to reuse the client components
on the server side over time. For example, as part of the security work, we
are looking into replacing the server side network code with the client
network code (KAFKA-1928). However, the client network already has metrics
based on KM.

Thanks,

Jun

On Sat, Mar 28, 2015 at 1:34 PM, Jay Kreps  wrote:

> I think Joel's summary is good.
>
> I'll add a few more points:
>
> As discussed memory matter a lot if we want to be able to give percentiles
> at the client or topic level, in which case we will have thousands of them.
> If we just do histograms at the global level then it is not a concern. The
> argument for doing histograms at the client and topic level is that
> averages are often very misleading, especially for latency information or
> other asymmetric distributions. Most people who care about this kind of
> thing would say the same. If you are a user of a multi-tenant cluster then
> you probably care a lot more about stats for your application or your topic
> rather than the global, so it could be nice to have histograms for these. I
> don't feel super strongly about this.
>
> The ExponentiallyDecayingSample is internally
> a ConcurrentSkipListMap. This seems to have an overhead of
> about 64 bytes per entry. So a 1000 element sample is 64KB. For global
> metrics this is fine, but for granular metrics not workable.
>
> Two other issues I'm not sure about:
>
> 1. Is there a way to get metric descriptions into the coda hale JMX output?
> One of the really nicest practical things about the new client metrics is
> that if you look at them in jconsole each metric has an associated
> description that explains what it means. I think this is a nice usability
> thing--it is really hard to know what to make of the current metrics
> without this kind of documentation and keeping separate docs up-to-date is
> really hard and even if you do it most people won't find it.
>
> 2. I'm not clear if the sample decay in the histogram is actually the same
> as for the other stats. It seems like it isn't but this would make
> interpretation quite difficult. In other words if I have N metrics
> including some Histograms some Meters, etc are all these measurements all
> taken over the same time window? I actually think they are not, it looks
> like there are different sampling methodologies across. So this means if
> you have a dashboard that plots these things side by side the measurement
> at a given point in time is not actually comparable across multiple stats.
> Am I confused about this?
>
> -Jay
>
>
> On Fri, Mar 27, 2015 at 6:27 PM, Joel Koshy  wrote:
>
> > For the samples: it will be at least double that estimate I think
> > since the long array contains (eight byte) references to the actual
> > longs, each of which also have some object overhead.
> >
> > Re: testing: actually, it looks like YM metrics does allow you to
> > drop in your own clock:
> >
> >
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Clock.java
> >
> >
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Meter.java#L36
> >
> > Not sure if it was mentioned in this (or some recent) thread but a
> > major motivation in the kafka-common metrics (KM) was absorbing API
> > changes and even mbean naming conventions. For e.g., in the early
> > stages of 0.8 we picked up YM metrics 3.x but collided with client
> > apps at LinkedIn which were still on 2.x. We ended up changing our
> > code to use 2.x in the end. Having our own metrics package makes us
> > less vulnerable to these kinds of changes. The multiple version
> > collision problem is obviously less of an issue with the broker but we
> > are still exposed to possible metric changes in YM metrics.
> >
> > I'm wondering if we need to weigh too much toward the memory overheads
> > of histograms in making a decision here simply because I don't think
> > we have found

Re: Metrics package discussion

2015-03-29 Thread Jun Rao
ans can logically group attributes computed from different
> > sensors. So there is logical grouping (as opposed to a separate
> > mbean per sensor as is the case in YM metrics).
> >
> > The main disadvantages:
> > - Everyone's graphs and alerts will break and need to be updated
> > - Histogram support needs to be tested more/improved
> >
> > The first disadvantage is a big one but we aren't exactly immune to
> > that if we stick with YM.
> >
> > BTW with KM metrics we should also provide reporters (graphite,
> > ganglia) but we probably need to do this anyway since the new clients
> > are on KM metrics.
> >
> > Thanks,
> >
> > Joel
> >
> > On Fri, Mar 27, 2015 at 06:48:48PM +, Aditya Auradkar wrote:
> > > Adding to what Jay said.
> > >
> > > The library maintains 1k samples by default. The UniformSample has a
> > long array so about 8k overhead per histogram. The
> > ExponentiallyDecayingSample (which is what we use) has a 16 byte overhead
> > per stored sample, so about 16k per histogram. So 10k histograms (worst
> > case? metrics per partition and client) is about 160MB of memory in the
> > broker.
> > >
> > > Copying is also a problem. For  percentiles on HistogramMBean, the
> > implementation does a copy of the entire array. For e.g., if we called
> > get50Percentile() and get75Percentile(), the entire array would get
> copied
> > twice which is pretty bad if we called each metric on every MBean.
> > >
> > > Another point Joel mentioned is that codahale metrics are harder to
> > write tests against because we cannot pass in a Clock.
> > >
> > > IMO, if a library is preventing us from adding all the metrics that we
> > want to add and we have a viable alternative, we should replace it. It
> > might be short term pain but in the long run we will have more useful
> > graphs.
> > > What do people think? I can start a vote thread on this once we have a
> > couple more opinions.
> > >
> > > Thanks,
> > > Aditya
> > > 
> > > From: Jay Kreps [jay.kr...@gmail.com]
> > > Sent: Thursday, March 26, 2015 2:29 PM
> > > To: dev@kafka.apache.org
> > > Subject: Re: Metrics package discussion
> > >
> > > Yeah that is a good summary.
> > >
> > > The reason we don't use histograms heavily in the server is because of
> > the
> > > memory issues. We originally did use histograms for everything, then we
> > ran
> > > into all these issues, and ripped them out. Whether they are really
> > useful
> > > or not, I don't know. Averages can be pretty misleading so it can be
> nice
> > > but I don't know that it is critical.
> > >
> > > -Jay
> > >
> > > On Thu, Mar 26, 2015 at 1:58 PM, Aditya Auradkar <
> > > aaurad...@linkedin.com.invalid> wrote:
> > >
> > > > From what I can tell, Histograms don't seem to be used extensively in
> > the
> > > > Kafka server (only in RequestChannel.scala) and I'm not sure we need
> > them
> > > > for per-client metrics. Topic metrics use meters currently.
> Migrating
> > > > graphing, alerting will be quite a significant effort for all users
> of
> > > > Kafka. Do the potential benefits of the new metrics package outweigh
> > this
> > > > one time migration? In the long run it seems nice to have a unified
> > metrics
> > > > package across clients and server. If we were starting out from
> scratch
> > > > without any existing deployments, what decision would we take?
> > > >
> > > > I suppose the relative effort in supporting is a useful data point in
> > this
> > > > discussion. We need to throttle based on the current byte rate which
> > should
> > > > be a "Meter" in codahale terms. The Meter implementation uses a 1, 5
> > and 15
> > > > minute exponential window moving average. The library also does not
> > use the
> > > > most recent samples of data for Metered metrics. For calculating
> > rates, the
> > > > EWMA class has a scheduled task that runs every 5 seconds and adjusts
> > the
> > > > rate using the new data accordingly. In that particular case, I think
> > the
> > > > new library is superior since it is more responsive.  If we do choose
> > to
> > > > remain with Yammer on the server, here are a few 

Re: Metrics package discussion

2015-03-28 Thread Jay Kreps
> >
> > The library maintains 1k samples by default. The UniformSample has a
> long array so about 8k overhead per histogram. The
> ExponentiallyDecayingSample (which is what we use) has a 16 byte overhead
> per stored sample, so about 16k per histogram. So 10k histograms (worst
> case? metrics per partition and client) is about 160MB of memory in the
> broker.
> >
> > Copying is also a problem. For  percentiles on HistogramMBean, the
> implementation does a copy of the entire array. For e.g., if we called
> get50Percentile() and get75Percentile(), the entire array would get copied
> twice which is pretty bad if we called each metric on every MBean.
> >
> > Another point Joel mentioned is that codahale metrics are harder to
> write tests against because we cannot pass in a Clock.
> >
> > IMO, if a library is preventing us from adding all the metrics that we
> want to add and we have a viable alternative, we should replace it. It
> might be short term pain but in the long run we will have more useful
> graphs.
> > What do people think? I can start a vote thread on this once we have a
> couple more opinions.
> >
> > Thanks,
> > Aditya
> > 
> > From: Jay Kreps [jay.kr...@gmail.com]
> > Sent: Thursday, March 26, 2015 2:29 PM
> > To: dev@kafka.apache.org
> > Subject: Re: Metrics package discussion
> >
> > Yeah that is a good summary.
> >
> > The reason we don't use histograms heavily in the server is because of
> the
> > memory issues. We originally did use histograms for everything, then we
> ran
> > into all these issues, and ripped them out. Whether they are really
> useful
> > or not, I don't know. Averages can be pretty misleading so it can be nice
> > but I don't know that it is critical.
> >
> > -Jay
> >
> > On Thu, Mar 26, 2015 at 1:58 PM, Aditya Auradkar <
> > aaurad...@linkedin.com.invalid> wrote:
> >
> > > From what I can tell, Histograms don't seem to be used extensively in
> the
> > > Kafka server (only in RequestChannel.scala) and I'm not sure we need
> them
> > > for per-client metrics. Topic metrics use meters currently.  Migrating
> > > graphing, alerting will be quite a significant effort for all users of
> > > Kafka. Do the potential benefits of the new metrics package outweigh
> this
> > > one time migration? In the long run it seems nice to have a unified
> metrics
> > > package across clients and server. If we were starting out from scratch
> > > without any existing deployments, what decision would we take?
> > >
> > > I suppose the relative effort in supporting is a useful data point in
> this
> > > discussion. We need to throttle based on the current byte rate which
> should
> > > be a "Meter" in codahale terms. The Meter implementation uses a 1, 5
> and 15
> > > minute exponential window moving average. The library also does not
> use the
> > > most recent samples of data for Metered metrics. For calculating
> rates, the
> > > EWMA class has a scheduled task that runs every 5 seconds and adjusts
> the
> > > rate using the new data accordingly. In that particular case, I think
> the
> > > new library is superior since it is more responsive.  If we do choose
> to
> > > remain with Yammer on the server, here are a few ideas on how to
> support
> > > quotas with relatively less effort.
> > >
> > > - We could have a new type of Meter called "QuotaMeter" that can wrap
> the
> > > existing meter code that follows the same pattern that the Sensor does
> in
> > > the new metrics library. This QuotaMeter needs to be configured with a
> > > Quota and it can have a finer grained rate than 1 minute (10 seconds?
> > > configurable?). Anytime we call "mark()", it update the underlying
> rates
> > > and throw a QuotaViolationException if required. This class can either
> > > extend Meter or be a separate implementation of the Metric superclass
> that
> > > every metric implements.
> > >
> > > - We can also consider implementing these quotas with the new metrics
> > > package and have these co-exist with the existing metrics. This leads
> to 2
> > > metric packages being used on the server, but they are both pulled in
> as
> > > dependencies anyway. Using this for metrics we can quota on may not be
> a
> > > bad place to start.
> > >
> > > Thanks,
> > > Aditya
> > > 

Re: Metrics package discussion

2015-03-27 Thread Jun Rao
A few more comments.

Currently, we control the YM jmx metric name by using a different
constructor of MetricName. So, the jmx names should remain unchanged when
upgrading YM. We probably should check if that constructor still exists in
the new version of YM.

The current histogram implementation in KM requires the user to know the
range of the value and pick a bucketing scheme, which makes it a bit
inconvenient to use. Ideally, we would probably want a default bucketing
scheme that gives reasonable precision for any value range.

Thanks,

Jun


On Fri, Mar 27, 2015 at 6:27 PM, Joel Koshy  wrote:

> For the samples: it will be at least double that estimate I think
> since the long array contains (eight byte) references to the actual
> longs, each of which also have some object overhead.
>
> Re: testing: actually, it looks like YM metrics does allow you to
> drop in your own clock:
>
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Clock.java
>
> https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Meter.java#L36
>
> Not sure if it was mentioned in this (or some recent) thread but a
> major motivation in the kafka-common metrics (KM) was absorbing API
> changes and even mbean naming conventions. For e.g., in the early
> stages of 0.8 we picked up YM metrics 3.x but collided with client
> apps at LinkedIn which were still on 2.x. We ended up changing our
> code to use 2.x in the end. Having our own metrics package makes us
> less vulnerable to these kinds of changes. The multiple version
> collision problem is obviously less of an issue with the broker but we
> are still exposed to possible metric changes in YM metrics.
>
> I'm wondering if we need to weigh too much toward the memory overheads
> of histograms in making a decision here simply because I don't think
> we have found them to be an extreme necessity for
> per-clientid/per-partition metrics and they are more critical for
> aggregate (global) metrics.
>
> So it seems the main benefits of switching to KM metrics are:
> - Less exposure to YM metrics changes
> - More control over the actual implementation. E.g., there is
>   considerable research on implementing approximate-but-good-enough
>   histograms/percentiles that we can try out
> - Differences (improvements) from YM metrics such as:
>   - hierarchical sensors
>   - integrated with quota enforcement
>   - mbeans can logically group attributes computed from different
> sensors. So there is logical grouping (as opposed to a separate
> mbean per sensor as is the case in YM metrics).
>
> The main disadvantages:
> - Everyone's graphs and alerts will break and need to be updated
> - Histogram support needs to be tested more/improved
>
> The first disadvantage is a big one but we aren't exactly immune to
> that if we stick with YM.
>
> BTW with KM metrics we should also provide reporters (graphite,
> ganglia) but we probably need to do this anyway since the new clients
> are on KM metrics.
>
> Thanks,
>
> Joel
>
> On Fri, Mar 27, 2015 at 06:48:48PM +, Aditya Auradkar wrote:
> > Adding to what Jay said.
> >
> > The library maintains 1k samples by default. The UniformSample has a
> long array so about 8k overhead per histogram. The
> ExponentiallyDecayingSample (which is what we use) has a 16 byte overhead
> per stored sample, so about 16k per histogram. So 10k histograms (worst
> case? metrics per partition and client) is about 160MB of memory in the
> broker.
> >
> > Copying is also a problem. For  percentiles on HistogramMBean, the
> implementation does a copy of the entire array. For e.g., if we called
> get50Percentile() and get75Percentile(), the entire array would get copied
> twice which is pretty bad if we called each metric on every MBean.
> >
> > Another point Joel mentioned is that codahale metrics are harder to
> write tests against because we cannot pass in a Clock.
> >
> > IMO, if a library is preventing us from adding all the metrics that we
> want to add and we have a viable alternative, we should replace it. It
> might be short term pain but in the long run we will have more useful
> graphs.
> > What do people think? I can start a vote thread on this once we have a
> couple more opinions.
> >
> > Thanks,
> > Aditya
> > 
> > From: Jay Kreps [jay.kr...@gmail.com]
> > Sent: Thursday, March 26, 2015 2:29 PM
> > To: dev@kafka.apache.org
> > Subject: Re: Metrics package discussion
> >
> > Yeah that is a good summary.
> >
> > The reason we don't use histograms heavily in the server is because 

Re: Metrics package discussion

2015-03-27 Thread Joel Koshy
For the samples: it will be at least double that estimate I think
since the long array contains (eight byte) references to the actual
longs, each of which also have some object overhead.

Re: testing: actually, it looks like YM metrics does allow you to
drop in your own clock:
https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Clock.java
https://github.com/dropwizard/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Meter.java#L36

Not sure if it was mentioned in this (or some recent) thread but a
major motivation in the kafka-common metrics (KM) was absorbing API
changes and even mbean naming conventions. For e.g., in the early
stages of 0.8 we picked up YM metrics 3.x but collided with client
apps at LinkedIn which were still on 2.x. We ended up changing our
code to use 2.x in the end. Having our own metrics package makes us
less vulnerable to these kinds of changes. The multiple version
collision problem is obviously less of an issue with the broker but we
are still exposed to possible metric changes in YM metrics.

I'm wondering if we need to weigh too much toward the memory overheads
of histograms in making a decision here simply because I don't think
we have found them to be an extreme necessity for
per-clientid/per-partition metrics and they are more critical for
aggregate (global) metrics.

So it seems the main benefits of switching to KM metrics are:
- Less exposure to YM metrics changes
- More control over the actual implementation. E.g., there is
  considerable research on implementing approximate-but-good-enough
  histograms/percentiles that we can try out
- Differences (improvements) from YM metrics such as:
  - hierarchical sensors
  - integrated with quota enforcement
  - mbeans can logically group attributes computed from different
sensors. So there is logical grouping (as opposed to a separate
mbean per sensor as is the case in YM metrics).

The main disadvantages:
- Everyone's graphs and alerts will break and need to be updated
- Histogram support needs to be tested more/improved

The first disadvantage is a big one but we aren't exactly immune to
that if we stick with YM.

BTW with KM metrics we should also provide reporters (graphite,
ganglia) but we probably need to do this anyway since the new clients
are on KM metrics.

Thanks,

Joel

On Fri, Mar 27, 2015 at 06:48:48PM +, Aditya Auradkar wrote:
> Adding to what Jay said. 
> 
> The library maintains 1k samples by default. The UniformSample has a long 
> array so about 8k overhead per histogram. The ExponentiallyDecayingSample 
> (which is what we use) has a 16 byte overhead per stored sample, so about 16k 
> per histogram. So 10k histograms (worst case? metrics per partition and 
> client) is about 160MB of memory in the broker.
> 
> Copying is also a problem. For  percentiles on HistogramMBean, the 
> implementation does a copy of the entire array. For e.g., if we called 
> get50Percentile() and get75Percentile(), the entire array would get copied 
> twice which is pretty bad if we called each metric on every MBean.
> 
> Another point Joel mentioned is that codahale metrics are harder to write 
> tests against because we cannot pass in a Clock. 
> 
> IMO, if a library is preventing us from adding all the metrics that we want 
> to add and we have a viable alternative, we should replace it. It might be 
> short term pain but in the long run we will have more useful graphs. 
> What do people think? I can start a vote thread on this once we have a couple 
> more opinions.
> 
> Thanks,
> Aditya
> 
> From: Jay Kreps [jay.kr...@gmail.com]
> Sent: Thursday, March 26, 2015 2:29 PM
> To: dev@kafka.apache.org
> Subject: Re: Metrics package discussion
> 
> Yeah that is a good summary.
> 
> The reason we don't use histograms heavily in the server is because of the
> memory issues. We originally did use histograms for everything, then we ran
> into all these issues, and ripped them out. Whether they are really useful
> or not, I don't know. Averages can be pretty misleading so it can be nice
> but I don't know that it is critical.
> 
> -Jay
> 
> On Thu, Mar 26, 2015 at 1:58 PM, Aditya Auradkar <
> aaurad...@linkedin.com.invalid> wrote:
> 
> > From what I can tell, Histograms don't seem to be used extensively in the
> > Kafka server (only in RequestChannel.scala) and I'm not sure we need them
> > for per-client metrics. Topic metrics use meters currently.  Migrating
> > graphing, alerting will be quite a significant effort for all users of
> > Kafka. Do the potential benefits of the new metrics package outweigh this
> > one time migration? In the long run it seems nice to have a unified metrics
> > package across

RE: Metrics package discussion

2015-03-27 Thread Aditya Auradkar
Adding to what Jay said. 

The library maintains 1k samples by default. The UniformSample has a long array 
so about 8k overhead per histogram. The ExponentiallyDecayingSample (which is 
what we use) has a 16 byte overhead per stored sample, so about 16k per 
histogram. So 10k histograms (worst case? metrics per partition and client) is 
about 160MB of memory in the broker.

Copying is also a problem. For  percentiles on HistogramMBean, the 
implementation does a copy of the entire array. For e.g., if we called 
get50Percentile() and get75Percentile(), the entire array would get copied 
twice which is pretty bad if we called each metric on every MBean.

Another point Joel mentioned is that codahale metrics are harder to write tests 
against because we cannot pass in a Clock. 

IMO, if a library is preventing us from adding all the metrics that we want to 
add and we have a viable alternative, we should replace it. It might be short 
term pain but in the long run we will have more useful graphs. 
What do people think? I can start a vote thread on this once we have a couple 
more opinions.

Thanks,
Aditya

From: Jay Kreps [jay.kr...@gmail.com]
Sent: Thursday, March 26, 2015 2:29 PM
To: dev@kafka.apache.org
Subject: Re: Metrics package discussion

Yeah that is a good summary.

The reason we don't use histograms heavily in the server is because of the
memory issues. We originally did use histograms for everything, then we ran
into all these issues, and ripped them out. Whether they are really useful
or not, I don't know. Averages can be pretty misleading so it can be nice
but I don't know that it is critical.

-Jay

On Thu, Mar 26, 2015 at 1:58 PM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> From what I can tell, Histograms don't seem to be used extensively in the
> Kafka server (only in RequestChannel.scala) and I'm not sure we need them
> for per-client metrics. Topic metrics use meters currently.  Migrating
> graphing, alerting will be quite a significant effort for all users of
> Kafka. Do the potential benefits of the new metrics package outweigh this
> one time migration? In the long run it seems nice to have a unified metrics
> package across clients and server. If we were starting out from scratch
> without any existing deployments, what decision would we take?
>
> I suppose the relative effort in supporting is a useful data point in this
> discussion. We need to throttle based on the current byte rate which should
> be a "Meter" in codahale terms. The Meter implementation uses a 1, 5 and 15
> minute exponential window moving average. The library also does not use the
> most recent samples of data for Metered metrics. For calculating rates, the
> EWMA class has a scheduled task that runs every 5 seconds and adjusts the
> rate using the new data accordingly. In that particular case, I think the
> new library is superior since it is more responsive.  If we do choose to
> remain with Yammer on the server, here are a few ideas on how to support
> quotas with relatively less effort.
>
> - We could have a new type of Meter called "QuotaMeter" that can wrap the
> existing meter code that follows the same pattern that the Sensor does in
> the new metrics library. This QuotaMeter needs to be configured with a
> Quota and it can have a finer grained rate than 1 minute (10 seconds?
> configurable?). Anytime we call "mark()", it update the underlying rates
> and throw a QuotaViolationException if required. This class can either
> extend Meter or be a separate implementation of the Metric superclass that
> every metric implements.
>
> - We can also consider implementing these quotas with the new metrics
> package and have these co-exist with the existing metrics. This leads to 2
> metric packages being used on the server, but they are both pulled in as
> dependencies anyway. Using this for metrics we can quota on may not be a
> bad place to start.
>
> Thanks,
> Aditya
> ________
> From: Jay Kreps [jay.kr...@gmail.com]
> Sent: Wednesday, March 25, 2015 11:08 PM
> To: dev@kafka.apache.org
> Subject: Re: Metrics package discussion
>
> Here was my understanding of the issue last time.
>
> The yammer metrics use a random sample of requests to estimate the
> histogram. This allocates a fairly large array of longs (their values are
> longs rather than floats). A reasonable sample might be 8k entries which
> would give about 64KB per histogram. There are bounds on accuracy, but they
> are only probabilistic. I.e. if you try to get 99% < 5 ms of inaccuracy,
> you will 1% of the time get more than this. This is okay but if you try to
> alert, in which you realize that being wrong 1% of the time is a lot if you
> are co

Re: Metrics package discussion

2015-03-26 Thread Jay Kreps
Yeah that is a good summary.

The reason we don't use histograms heavily in the server is because of the
memory issues. We originally did use histograms for everything, then we ran
into all these issues, and ripped them out. Whether they are really useful
or not, I don't know. Averages can be pretty misleading so it can be nice
but I don't know that it is critical.

-Jay

On Thu, Mar 26, 2015 at 1:58 PM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> From what I can tell, Histograms don't seem to be used extensively in the
> Kafka server (only in RequestChannel.scala) and I'm not sure we need them
> for per-client metrics. Topic metrics use meters currently.  Migrating
> graphing, alerting will be quite a significant effort for all users of
> Kafka. Do the potential benefits of the new metrics package outweigh this
> one time migration? In the long run it seems nice to have a unified metrics
> package across clients and server. If we were starting out from scratch
> without any existing deployments, what decision would we take?
>
> I suppose the relative effort in supporting is a useful data point in this
> discussion. We need to throttle based on the current byte rate which should
> be a "Meter" in codahale terms. The Meter implementation uses a 1, 5 and 15
> minute exponential window moving average. The library also does not use the
> most recent samples of data for Metered metrics. For calculating rates, the
> EWMA class has a scheduled task that runs every 5 seconds and adjusts the
> rate using the new data accordingly. In that particular case, I think the
> new library is superior since it is more responsive.  If we do choose to
> remain with Yammer on the server, here are a few ideas on how to support
> quotas with relatively less effort.
>
> - We could have a new type of Meter called "QuotaMeter" that can wrap the
> existing meter code that follows the same pattern that the Sensor does in
> the new metrics library. This QuotaMeter needs to be configured with a
> Quota and it can have a finer grained rate than 1 minute (10 seconds?
> configurable?). Anytime we call "mark()", it update the underlying rates
> and throw a QuotaViolationException if required. This class can either
> extend Meter or be a separate implementation of the Metric superclass that
> every metric implements.
>
> - We can also consider implementing these quotas with the new metrics
> package and have these co-exist with the existing metrics. This leads to 2
> metric packages being used on the server, but they are both pulled in as
> dependencies anyway. Using this for metrics we can quota on may not be a
> bad place to start.
>
> Thanks,
> Aditya
> ________
> From: Jay Kreps [jay.kr...@gmail.com]
> Sent: Wednesday, March 25, 2015 11:08 PM
> To: dev@kafka.apache.org
> Subject: Re: Metrics package discussion
>
> Here was my understanding of the issue last time.
>
> The yammer metrics use a random sample of requests to estimate the
> histogram. This allocates a fairly large array of longs (their values are
> longs rather than floats). A reasonable sample might be 8k entries which
> would give about 64KB per histogram. There are bounds on accuracy, but they
> are only probabilistic. I.e. if you try to get 99% < 5 ms of inaccuracy,
> you will 1% of the time get more than this. This is okay but if you try to
> alert, in which you realize that being wrong 1% of the time is a lot if you
> are computing stats every second continuously on many metrics (i.e. 1 in
> 100 estimates will be outside you bound). This array is copied in full
> every time you check the metric which is the other cause of the memory
> pressure.
>
> The better approach to histograms is to calculate buckets boundaries and
> record arbitrarily many values in those buckets. A simple bucketing
> approach for latency would be 0, 5ms, 10ms, 15ms, etc, and you just count
> how many fall in each bucket. Your precision is deterministically bounded
> by the bucket boundaries, so if you had 5ms buckets you would never have
> more than 5ms loss of precision. By using non-uniform bucket sizes you can
> make this work even better (e.g. give ~1ms precision for latencies in the
> 1ms range, but give only 1 second precision for latencies in the 30 second
> range). That is what is implemented in that metrics package.
>
> I think this bucketing approach is popular now. There is a whole "HDR
> histogram" library that gives lots of different bucketing methods and
> implements dynamic resizing so you don't have to specify an upper bound.
>  https://github.com/HdrHistogram/HdrHistogram
>
> Whether this matters depends entirely if you want histograms b

RE: Metrics package discussion

2015-03-26 Thread Aditya Auradkar
>From what I can tell, Histograms don't seem to be used extensively in the 
>Kafka server (only in RequestChannel.scala) and I'm not sure we need them for 
>per-client metrics. Topic metrics use meters currently.  Migrating graphing, 
>alerting will be quite a significant effort for all users of Kafka. Do the 
>potential benefits of the new metrics package outweigh this one time 
>migration? In the long run it seems nice to have a unified metrics package 
>across clients and server. If we were starting out from scratch without any 
>existing deployments, what decision would we take?

I suppose the relative effort in supporting is a useful data point in this 
discussion. We need to throttle based on the current byte rate which should be 
a "Meter" in codahale terms. The Meter implementation uses a 1, 5 and 15 minute 
exponential window moving average. The library also does not use the most 
recent samples of data for Metered metrics. For calculating rates, the EWMA 
class has a scheduled task that runs every 5 seconds and adjusts the rate using 
the new data accordingly. In that particular case, I think the new library is 
superior since it is more responsive.  If we do choose to remain with Yammer on 
the server, here are a few ideas on how to support quotas with relatively less 
effort.  

- We could have a new type of Meter called "QuotaMeter" that can wrap the 
existing meter code that follows the same pattern that the Sensor does in the 
new metrics library. This QuotaMeter needs to be configured with a Quota and it 
can have a finer grained rate than 1 minute (10 seconds? configurable?). 
Anytime we call "mark()", it update the underlying rates and throw a 
QuotaViolationException if required. This class can either extend Meter or be a 
separate implementation of the Metric superclass that every metric implements.

- We can also consider implementing these quotas with the new metrics package 
and have these co-exist with the existing metrics. This leads to 2 metric 
packages being used on the server, but they are both pulled in as dependencies 
anyway. Using this for metrics we can quota on may not be a bad place to start.

Thanks,
Aditya

From: Jay Kreps [jay.kr...@gmail.com]
Sent: Wednesday, March 25, 2015 11:08 PM
To: dev@kafka.apache.org
Subject: Re: Metrics package discussion

Here was my understanding of the issue last time.

The yammer metrics use a random sample of requests to estimate the
histogram. This allocates a fairly large array of longs (their values are
longs rather than floats). A reasonable sample might be 8k entries which
would give about 64KB per histogram. There are bounds on accuracy, but they
are only probabilistic. I.e. if you try to get 99% < 5 ms of inaccuracy,
you will 1% of the time get more than this. This is okay but if you try to
alert, in which you realize that being wrong 1% of the time is a lot if you
are computing stats every second continuously on many metrics (i.e. 1 in
100 estimates will be outside you bound). This array is copied in full
every time you check the metric which is the other cause of the memory
pressure.

The better approach to histograms is to calculate buckets boundaries and
record arbitrarily many values in those buckets. A simple bucketing
approach for latency would be 0, 5ms, 10ms, 15ms, etc, and you just count
how many fall in each bucket. Your precision is deterministically bounded
by the bucket boundaries, so if you had 5ms buckets you would never have
more than 5ms loss of precision. By using non-uniform bucket sizes you can
make this work even better (e.g. give ~1ms precision for latencies in the
1ms range, but give only 1 second precision for latencies in the 30 second
range). That is what is implemented in that metrics package.

I think this bucketing approach is popular now. There is a whole "HDR
histogram" library that gives lots of different bucketing methods and
implements dynamic resizing so you don't have to specify an upper bound.
 https://github.com/HdrHistogram/HdrHistogram

Whether this matters depends entirely if you want histograms broken down at
the client, topic, partition, or broker level or just want overall metrics.
If we just want per sever aggregates for histograms then I think the memory
usage is not a huge issue. If you want a histogram per topic or client or
partition and have 10k of these then that is where you start talking like
1GB of memory with the yammer package, which is what we hit last time.
Getting percentiles on the client level is nice, percentiles are definitely
better than averages, but I'm not sure it is required.

-Jay

On Wed, Mar 25, 2015 at 9:43 PM, Neha Narkhede  wrote:

> Aditya,
>
> If we are doing a deep dive, one of the things to investigate would be
> memory/GC performance. IIRC, when I was looking into codahale at LinkedIn,
> I remember it having 

Re: Metrics package discussion

2015-03-25 Thread Jay Kreps
Here was my understanding of the issue last time.

The yammer metrics use a random sample of requests to estimate the
histogram. This allocates a fairly large array of longs (their values are
longs rather than floats). A reasonable sample might be 8k entries which
would give about 64KB per histogram. There are bounds on accuracy, but they
are only probabilistic. I.e. if you try to get 99% < 5 ms of inaccuracy,
you will 1% of the time get more than this. This is okay but if you try to
alert, in which you realize that being wrong 1% of the time is a lot if you
are computing stats every second continuously on many metrics (i.e. 1 in
100 estimates will be outside you bound). This array is copied in full
every time you check the metric which is the other cause of the memory
pressure.

The better approach to histograms is to calculate buckets boundaries and
record arbitrarily many values in those buckets. A simple bucketing
approach for latency would be 0, 5ms, 10ms, 15ms, etc, and you just count
how many fall in each bucket. Your precision is deterministically bounded
by the bucket boundaries, so if you had 5ms buckets you would never have
more than 5ms loss of precision. By using non-uniform bucket sizes you can
make this work even better (e.g. give ~1ms precision for latencies in the
1ms range, but give only 1 second precision for latencies in the 30 second
range). That is what is implemented in that metrics package.

I think this bucketing approach is popular now. There is a whole "HDR
histogram" library that gives lots of different bucketing methods and
implements dynamic resizing so you don't have to specify an upper bound.
 https://github.com/HdrHistogram/HdrHistogram

Whether this matters depends entirely if you want histograms broken down at
the client, topic, partition, or broker level or just want overall metrics.
If we just want per sever aggregates for histograms then I think the memory
usage is not a huge issue. If you want a histogram per topic or client or
partition and have 10k of these then that is where you start talking like
1GB of memory with the yammer package, which is what we hit last time.
Getting percentiles on the client level is nice, percentiles are definitely
better than averages, but I'm not sure it is required.

-Jay

On Wed, Mar 25, 2015 at 9:43 PM, Neha Narkhede  wrote:

> Aditya,
>
> If we are doing a deep dive, one of the things to investigate would be
> memory/GC performance. IIRC, when I was looking into codahale at LinkedIn,
> I remember it having quite a few memory management and GC issues while
> using histograms. In comparison, histograms in the new metrics package
> aren't very well tested.
>
> Thanks,
> Neha
>
> On Wed, Mar 25, 2015 at 8:25 AM, Aditya Auradkar <
> aaurad...@linkedin.com.invalid> wrote:
>
> > Hey everyone,
> >
> > Picking up this discussion after yesterdays KIP hangout. For anyone who
> > did not join the meeting, we have 2 different metrics packages being used
> > by the clients (custom package) and the server (codahale). We are
> > discussing whether to migrate the server to the new package.
> >
> > What information do we need in order to make a decision?
> >
> > Some pros of the new package:
> > - Using the most recent information by combining data from previous and
> > current samples. I'm not sure how codahale does this so I'll investigate.
> > - We can quota on anything we measure. This is pretty cool IMO. I've
> > investigate the feasibility of adding this feature in codahale.
> > - Hierarchical metrics. For example: we can define a sensor for overall
> > bytes-in/bytes-out and also per-client. Updating the client sensor will
> > cause the global byte rate sensor to get modified too.
> >
> > What are some of the issues with codahale? One previous discussion
> > mentions high memory usage but I don't have any experience with it
> myself.
> >
> > Thanks,
> > Aditya
> >
> >
> >
> >
> >
>
>
> --
> Thanks,
> Neha
>


Re: Metrics package discussion

2015-03-25 Thread Neha Narkhede
Aditya,

If we are doing a deep dive, one of the things to investigate would be
memory/GC performance. IIRC, when I was looking into codahale at LinkedIn,
I remember it having quite a few memory management and GC issues while
using histograms. In comparison, histograms in the new metrics package
aren't very well tested.

Thanks,
Neha

On Wed, Mar 25, 2015 at 8:25 AM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> Hey everyone,
>
> Picking up this discussion after yesterdays KIP hangout. For anyone who
> did not join the meeting, we have 2 different metrics packages being used
> by the clients (custom package) and the server (codahale). We are
> discussing whether to migrate the server to the new package.
>
> What information do we need in order to make a decision?
>
> Some pros of the new package:
> - Using the most recent information by combining data from previous and
> current samples. I'm not sure how codahale does this so I'll investigate.
> - We can quota on anything we measure. This is pretty cool IMO. I've
> investigate the feasibility of adding this feature in codahale.
> - Hierarchical metrics. For example: we can define a sensor for overall
> bytes-in/bytes-out and also per-client. Updating the client sensor will
> cause the global byte rate sensor to get modified too.
>
> What are some of the issues with codahale? One previous discussion
> mentions high memory usage but I don't have any experience with it myself.
>
> Thanks,
> Aditya
>
>
>
>
>


-- 
Thanks,
Neha