[DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-03-29 Thread Guozhang Wang
Hello folks,

I'd like to propose the following KIP to improve the Kafka Streams metrics
mechanism to users. This includes 1) a minor change in the public
StreamsMetrics API, and 2) a major cleanup on the Streams' own built-in
metrics hierarchy.

Details can be found here:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams

I'd love to hear your thoughts and feedbacks. Thanks!

-- 
-- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-04-26 Thread Matthias J. Sax
There was a discussion about metric on KIP-439. Would be good if we
could incorporate those thought into this discussion.

-Matthias

On 3/29/19 10:56 AM, Guozhang Wang wrote:
> Hello folks,
> 
> I'd like to propose the following KIP to improve the Kafka Streams metrics
> mechanism to users. This includes 1) a minor change in the public
> StreamsMetrics API, and 2) a major cleanup on the Streams' own built-in
> metrics hierarchy.
> 
> Details can be found here:
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> 
> I'd love to hear your thoughts and feedbacks. Thanks!
> 



signature.asc
Description: OpenPGP digital signature


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-05-23 Thread Patrik Kleindl
Hi Guozhang
Thanks for the KIP, this looks very helpful.
Could you please provide more detail on the metrics planned for the state?
We were just considering how to implement this ourselves because we need to
track the history of stage changes.
The idea was to have an accumulated "seconds in state x" metric for every
state.
The new rebalance metric might solve part of our use case, but it is
interesting what you have planned for the state metric.
best regards
Patrik

On Fri, 29 Mar 2019 at 18:56, Guozhang Wang  wrote:

> Hello folks,
>
> I'd like to propose the following KIP to improve the Kafka Streams metrics
> mechanism to users. This includes 1) a minor change in the public
> StreamsMetrics API, and 2) a major cleanup on the Streams' own built-in
> metrics hierarchy.
>
> Details can be found here:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
>
> I'd love to hear your thoughts and feedbacks. Thanks!
>
> --
> -- Guozhang
>


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-05-23 Thread Guozhang Wang
Hello Patrik,

Since we are rolling out 2.3 and everyone is busy with the release now this
KIP does not have much discussion involved yet and will slip into the next
release cadence.

This KIP itself contains several parts itself: 1. refactoring the existing
metrics hierarchy to cleanup some redundancy and also get more clarity; 2.
add instance-level metrics like rebalance and state metrics, as well as
other static metrics.


Guozhang



On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl  wrote:

> Hi Guozhang
> Thanks for the KIP, this looks very helpful.
> Could you please provide more detail on the metrics planned for the state?
> We were just considering how to implement this ourselves because we need to
> track the history of stage changes.
> The idea was to have an accumulated "seconds in state x" metric for every
> state.
> The new rebalance metric might solve part of our use case, but it is
> interesting what you have planned for the state metric.
> best regards
> Patrik
>
> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang  wrote:
>
> > Hello folks,
> >
> > I'd like to propose the following KIP to improve the Kafka Streams
> metrics
> > mechanism to users. This includes 1) a minor change in the public
> > StreamsMetrics API, and 2) a major cleanup on the Streams' own built-in
> > metrics hierarchy.
> >
> > Details can be found here:
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> >
> > I'd love to hear your thoughts and feedbacks. Thanks!
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-06-27 Thread Guozhang Wang
Hello folks,

As 2.3 is released now, I'd like to bump up this KIP discussion again for
your reviews.


Guozhang


On Thu, May 23, 2019 at 4:44 PM Guozhang Wang  wrote:

> Hello Patrik,
>
> Since we are rolling out 2.3 and everyone is busy with the release now
> this KIP does not have much discussion involved yet and will slip into the
> next release cadence.
>
> This KIP itself contains several parts itself: 1. refactoring the existing
> metrics hierarchy to cleanup some redundancy and also get more clarity; 2.
> add instance-level metrics like rebalance and state metrics, as well as
> other static metrics.
>
>
> Guozhang
>
>
>
> On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl  wrote:
>
>> Hi Guozhang
>> Thanks for the KIP, this looks very helpful.
>> Could you please provide more detail on the metrics planned for the state?
>> We were just considering how to implement this ourselves because we need
>> to
>> track the history of stage changes.
>> The idea was to have an accumulated "seconds in state x" metric for every
>> state.
>> The new rebalance metric might solve part of our use case, but it is
>> interesting what you have planned for the state metric.
>> best regards
>> Patrik
>>
>> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang  wrote:
>>
>> > Hello folks,
>> >
>> > I'd like to propose the following KIP to improve the Kafka Streams
>> metrics
>> > mechanism to users. This includes 1) a minor change in the public
>> > StreamsMetrics API, and 2) a major cleanup on the Streams' own built-in
>> > metrics hierarchy.
>> >
>> > Details can be found here:
>> >
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
>> >
>> > I'd love to hear your thoughts and feedbacks. Thanks!
>> >
>> > --
>> > -- Guozhang
>> >
>>
>
>
> --
> -- Guozhang
>


-- 
-- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-01 Thread Bruno Cadonna
Hi Guozhang,

Thank you for the KIP.

1) As far as I understand, the StreamsMetrics interface is there for
user-defined processors. Would it make sense to also add a method to
the interface to specify a sensor that records skipped records?

2) What are the semantics of active-task-process and standby-task-process

3) How do dropped-late-records and expired-window-record-drop relate
to each other? I guess the former is for records that fall outside the
grace period and the latter is for records that are processed after
the retention period of the window. Is this correct?

4) Is there an actual difference between skipped and dropped records?
If not, shall we unify the terminology?

5) What happens with removed metrics when the user sets the version of
"built.in.metrics.version" to 2.2-

Best,
Bruno

On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang  wrote:
>
> Hello folks,
>
> As 2.3 is released now, I'd like to bump up this KIP discussion again for
> your reviews.
>
>
> Guozhang
>
>
> On Thu, May 23, 2019 at 4:44 PM Guozhang Wang  wrote:
>
> > Hello Patrik,
> >
> > Since we are rolling out 2.3 and everyone is busy with the release now
> > this KIP does not have much discussion involved yet and will slip into the
> > next release cadence.
> >
> > This KIP itself contains several parts itself: 1. refactoring the existing
> > metrics hierarchy to cleanup some redundancy and also get more clarity; 2.
> > add instance-level metrics like rebalance and state metrics, as well as
> > other static metrics.
> >
> >
> > Guozhang
> >
> >
> >
> > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl  wrote:
> >
> >> Hi Guozhang
> >> Thanks for the KIP, this looks very helpful.
> >> Could you please provide more detail on the metrics planned for the state?
> >> We were just considering how to implement this ourselves because we need
> >> to
> >> track the history of stage changes.
> >> The idea was to have an accumulated "seconds in state x" metric for every
> >> state.
> >> The new rebalance metric might solve part of our use case, but it is
> >> interesting what you have planned for the state metric.
> >> best regards
> >> Patrik
> >>
> >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang  wrote:
> >>
> >> > Hello folks,
> >> >
> >> > I'd like to propose the following KIP to improve the Kafka Streams
> >> metrics
> >> > mechanism to users. This includes 1) a minor change in the public
> >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own built-in
> >> > metrics hierarchy.
> >> >
> >> > Details can be found here:
> >> >
> >> >
> >> >
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> >> >
> >> > I'd love to hear your thoughts and feedbacks. Thanks!
> >> >
> >> > --
> >> > -- Guozhang
> >> >
> >>
> >
> >
> > --
> > -- Guozhang
> >
>
>
> --
> -- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-01 Thread Boyang Chen
Hey Guozhang,

do we plan to add per partition latency in this KIP?

On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> Thank you for the KIP.
>
> 1) As far as I understand, the StreamsMetrics interface is there for
> user-defined processors. Would it make sense to also add a method to
> the interface to specify a sensor that records skipped records?
>
> 2) What are the semantics of active-task-process and standby-task-process
>
> 3) How do dropped-late-records and expired-window-record-drop relate
> to each other? I guess the former is for records that fall outside the
> grace period and the latter is for records that are processed after
> the retention period of the window. Is this correct?
>
> 4) Is there an actual difference between skipped and dropped records?
> If not, shall we unify the terminology?
>
> 5) What happens with removed metrics when the user sets the version of
> "built.in.metrics.version" to 2.2-
>
> Best,
> Bruno
>
> On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang  wrote:
> >
> > Hello folks,
> >
> > As 2.3 is released now, I'd like to bump up this KIP discussion again for
> > your reviews.
> >
> >
> > Guozhang
> >
> >
> > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang 
> wrote:
> >
> > > Hello Patrik,
> > >
> > > Since we are rolling out 2.3 and everyone is busy with the release now
> > > this KIP does not have much discussion involved yet and will slip into
> the
> > > next release cadence.
> > >
> > > This KIP itself contains several parts itself: 1. refactoring the
> existing
> > > metrics hierarchy to cleanup some redundancy and also get more
> clarity; 2.
> > > add instance-level metrics like rebalance and state metrics, as well as
> > > other static metrics.
> > >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl 
> wrote:
> > >
> > >> Hi Guozhang
> > >> Thanks for the KIP, this looks very helpful.
> > >> Could you please provide more detail on the metrics planned for the
> state?
> > >> We were just considering how to implement this ourselves because we
> need
> > >> to
> > >> track the history of stage changes.
> > >> The idea was to have an accumulated "seconds in state x" metric for
> every
> > >> state.
> > >> The new rebalance metric might solve part of our use case, but it is
> > >> interesting what you have planned for the state metric.
> > >> best regards
> > >> Patrik
> > >>
> > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang 
> wrote:
> > >>
> > >> > Hello folks,
> > >> >
> > >> > I'd like to propose the following KIP to improve the Kafka Streams
> > >> metrics
> > >> > mechanism to users. This includes 1) a minor change in the public
> > >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own
> built-in
> > >> > metrics hierarchy.
> > >> >
> > >> > Details can be found here:
> > >> >
> > >> >
> > >> >
> > >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> > >> >
> > >> > I'd love to hear your thoughts and feedbacks. Thanks!
> > >> >
> > >> > --
> > >> > -- Guozhang
> > >> >
> > >>
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
> >
> > --
> > -- Guozhang
>


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-03 Thread Bruno Cadonna
Hi Guozhang,

I just noticed that the Per-State-Store tags are somehow mixed up in
the KIP (e.g "client-id=[threadId]").

Best,
Bruno

On Mon, Jul 1, 2019 at 6:28 PM Boyang Chen  wrote:
>
> Hey Guozhang,
>
> do we plan to add per partition latency in this KIP?
>
> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:
>
> > Hi Guozhang,
> >
> > Thank you for the KIP.
> >
> > 1) As far as I understand, the StreamsMetrics interface is there for
> > user-defined processors. Would it make sense to also add a method to
> > the interface to specify a sensor that records skipped records?
> >
> > 2) What are the semantics of active-task-process and standby-task-process
> >
> > 3) How do dropped-late-records and expired-window-record-drop relate
> > to each other? I guess the former is for records that fall outside the
> > grace period and the latter is for records that are processed after
> > the retention period of the window. Is this correct?
> >
> > 4) Is there an actual difference between skipped and dropped records?
> > If not, shall we unify the terminology?
> >
> > 5) What happens with removed metrics when the user sets the version of
> > "built.in.metrics.version" to 2.2-
> >
> > Best,
> > Bruno
> >
> > On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang  wrote:
> > >
> > > Hello folks,
> > >
> > > As 2.3 is released now, I'd like to bump up this KIP discussion again for
> > > your reviews.
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang 
> > wrote:
> > >
> > > > Hello Patrik,
> > > >
> > > > Since we are rolling out 2.3 and everyone is busy with the release now
> > > > this KIP does not have much discussion involved yet and will slip into
> > the
> > > > next release cadence.
> > > >
> > > > This KIP itself contains several parts itself: 1. refactoring the
> > existing
> > > > metrics hierarchy to cleanup some redundancy and also get more
> > clarity; 2.
> > > > add instance-level metrics like rebalance and state metrics, as well as
> > > > other static metrics.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl 
> > wrote:
> > > >
> > > >> Hi Guozhang
> > > >> Thanks for the KIP, this looks very helpful.
> > > >> Could you please provide more detail on the metrics planned for the
> > state?
> > > >> We were just considering how to implement this ourselves because we
> > need
> > > >> to
> > > >> track the history of stage changes.
> > > >> The idea was to have an accumulated "seconds in state x" metric for
> > every
> > > >> state.
> > > >> The new rebalance metric might solve part of our use case, but it is
> > > >> interesting what you have planned for the state metric.
> > > >> best regards
> > > >> Patrik
> > > >>
> > > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang 
> > wrote:
> > > >>
> > > >> > Hello folks,
> > > >> >
> > > >> > I'd like to propose the following KIP to improve the Kafka Streams
> > > >> metrics
> > > >> > mechanism to users. This includes 1) a minor change in the public
> > > >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own
> > built-in
> > > >> > metrics hierarchy.
> > > >> >
> > > >> > Details can be found here:
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> > > >> >
> > > >> > I'd love to hear your thoughts and feedbacks. Thanks!
> > > >> >
> > > >> > --
> > > >> > -- Guozhang
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> >


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-18 Thread Guozhang Wang
Hello Bruno,

Thanks for the feedbacks, replied inline.

On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> Thank you for the KIP.
>
> 1) As far as I understand, the StreamsMetrics interface is there for
> user-defined processors. Would it make sense to also add a method to
> the interface to specify a sensor that records skipped records?
>
> Not sure I follow.. if users want to add a specific skipped records
sensor, she can still do that as a "throughput" sensor via "
addThroughputSensor" and then "record" right?

As an after-thought, maybe it's better to rename `throughput` to `rate` in
the public APIs since it is really meant for the latter semantics. I did
not change it just to make less API changes / deprecate fewer functions.
But if we feel it is important we can change it as well.


> 2) What are the semantics of active-task-process and standby-task-process
>
> Ah good catch, I think I made it in the wrong column. Just some
explanations here: Within a thread's looped iterations, it will first try
to process some records from the active tasks, and then see if there are
any standby-tasks that can be processed as well (i.e. just reading from the
restore consumer and apply to the local stores). The ratio metrics are for
indicating 1) what tasks (active or standby) does this thread own so far,
and 2) how much time in percentage does it spend on each of them.

But this metric should really be a task-level one that includes both the
thread-id and task-id, and upon task migrations they will be dynamically
deleted / (re)-created. For each task-id it may be owned by multiple
threads as one active and others standby, and hence the separation of
active / standby seems still necessary.



> 3) How do dropped-late-records and expired-window-record-drop relate
> to each other? I guess the former is for records that fall outside the
> grace period and the latter is for records that are processed after
> the retention period of the window. Is this correct?
>
> Yes, that's correct. The names are indeed a bit confusing since they are
added at different releases historically..

More precisely, the `grace period` is a notion of the operator (hence the
metric is node-level, though it would only be used for DSL operators) while
the `retention` is a notion of the store (hence the metric is store-level).
Usually grace period will be smaller than store retention though.

Processor node is aware of `grace period` and when received a record that
is older than grace deadline, it will be dropped immediately; otherwise it
will still be processed a maybe a new update is "put" into the store. The
store is aware of its `retention period` and then upon a "put" call if it
realized it is older than the retention deadline, that put call would be
ignored and metric is recorded.

We have to separate them here since the window store can be used in both
DSL and PAPI, and for the former case it would likely to be already ignored
at the processor node level due to the grace period which is usually
smaller than retention; but for PAPI there's no grace period and hence the
processor would likely still process and call "put" on the store.


> 4) Is there an actual difference between skipped and dropped records?
> If not, shall we unify the terminology?
>
>
There is. Dropped records are only due to lateness; where as skipped
records can be due to serde errors (and user's error handling indicate
"skip and continue"), timestamp errors, etc.

I've considered maybe a better (more extensible) way would be defining a
single metric name, say skipped-records, but use different tags to indicate
if its skipping reason (errors, windowing semantics, etc). But there's
still a tricky difference: for serde caused skipping for example, they will
be skipped at the very beginning and there's no effects taken at all. For
some others e.g. null-key / value at the reduce operator, it is only
skipped at the middle of the processing, i.e. some effects may have already
been taken in up-stream sub-topologies. And that's why for skipped-records
I've defined it on both task-level and node-level and the aggregate of the
latter may still be smaller than the former, whereas for dropped-records it
is only for node-level.

So how about an even more significant change then: we enlarge the
`dropped-late-records` to `dropped-records` which is node-level only, but
includes reasons form lateness to semantics (like null-key) as well; and
then we have a task-level-only `skipped-records` which only record those
dropped at the very beginning and did not make it at all to the processing
topology. I feel this is a clearer distinguishment but also a bigger change
to users.


> 5) What happens with removed metrics when the user sets the version of
> "built.in.metrics.version" to 2.2-
>
> I think for those redundant ones like ""forward-rate" and "destroy-rate"
we can still remove them with 2.2- as well; for other ones that are removed
/ replaced like thread-level skipped

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-18 Thread Guozhang Wang
Good catch! Will update the wiki.

On Wed, Jul 3, 2019 at 6:19 AM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> I just noticed that the Per-State-Store tags are somehow mixed up in
> the KIP (e.g "client-id=[threadId]").
>
> Best,
> Bruno
>
> On Mon, Jul 1, 2019 at 6:28 PM Boyang Chen 
> wrote:
> >
> > Hey Guozhang,
> >
> > do we plan to add per partition latency in this KIP?
> >
> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:
> >
> > > Hi Guozhang,
> > >
> > > Thank you for the KIP.
> > >
> > > 1) As far as I understand, the StreamsMetrics interface is there for
> > > user-defined processors. Would it make sense to also add a method to
> > > the interface to specify a sensor that records skipped records?
> > >
> > > 2) What are the semantics of active-task-process and
> standby-task-process
> > >
> > > 3) How do dropped-late-records and expired-window-record-drop relate
> > > to each other? I guess the former is for records that fall outside the
> > > grace period and the latter is for records that are processed after
> > > the retention period of the window. Is this correct?
> > >
> > > 4) Is there an actual difference between skipped and dropped records?
> > > If not, shall we unify the terminology?
> > >
> > > 5) What happens with removed metrics when the user sets the version of
> > > "built.in.metrics.version" to 2.2-
> > >
> > > Best,
> > > Bruno
> > >
> > > On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
> wrote:
> > > >
> > > > Hello folks,
> > > >
> > > > As 2.3 is released now, I'd like to bump up this KIP discussion
> again for
> > > > your reviews.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang 
> > > wrote:
> > > >
> > > > > Hello Patrik,
> > > > >
> > > > > Since we are rolling out 2.3 and everyone is busy with the release
> now
> > > > > this KIP does not have much discussion involved yet and will slip
> into
> > > the
> > > > > next release cadence.
> > > > >
> > > > > This KIP itself contains several parts itself: 1. refactoring the
> > > existing
> > > > > metrics hierarchy to cleanup some redundancy and also get more
> > > clarity; 2.
> > > > > add instance-level metrics like rebalance and state metrics, as
> well as
> > > > > other static metrics.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl  >
> > > wrote:
> > > > >
> > > > >> Hi Guozhang
> > > > >> Thanks for the KIP, this looks very helpful.
> > > > >> Could you please provide more detail on the metrics planned for
> the
> > > state?
> > > > >> We were just considering how to implement this ourselves because
> we
> > > need
> > > > >> to
> > > > >> track the history of stage changes.
> > > > >> The idea was to have an accumulated "seconds in state x" metric
> for
> > > every
> > > > >> state.
> > > > >> The new rebalance metric might solve part of our use case, but it
> is
> > > > >> interesting what you have planned for the state metric.
> > > > >> best regards
> > > > >> Patrik
> > > > >>
> > > > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang 
> > > wrote:
> > > > >>
> > > > >> > Hello folks,
> > > > >> >
> > > > >> > I'd like to propose the following KIP to improve the Kafka
> Streams
> > > > >> metrics
> > > > >> > mechanism to users. This includes 1) a minor change in the
> public
> > > > >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own
> > > built-in
> > > > >> > metrics hierarchy.
> > > > >> >
> > > > >> > Details can be found here:
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> > > > >> >
> > > > >> > I'd love to hear your thoughts and feedbacks. Thanks!
> > > > >> >
> > > > >> > --
> > > > >> > -- Guozhang
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > >
>


-- 
-- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-18 Thread Guozhang Wang
Hi Boyang,

What do you mean by `per partition latency`?

Guozhang

On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen 
wrote:

> Hey Guozhang,
>
> do we plan to add per partition latency in this KIP?
>
> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:
>
> > Hi Guozhang,
> >
> > Thank you for the KIP.
> >
> > 1) As far as I understand, the StreamsMetrics interface is there for
> > user-defined processors. Would it make sense to also add a method to
> > the interface to specify a sensor that records skipped records?
> >
> > 2) What are the semantics of active-task-process and standby-task-process
> >
> > 3) How do dropped-late-records and expired-window-record-drop relate
> > to each other? I guess the former is for records that fall outside the
> > grace period and the latter is for records that are processed after
> > the retention period of the window. Is this correct?
> >
> > 4) Is there an actual difference between skipped and dropped records?
> > If not, shall we unify the terminology?
> >
> > 5) What happens with removed metrics when the user sets the version of
> > "built.in.metrics.version" to 2.2-
> >
> > Best,
> > Bruno
> >
> > On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
> wrote:
> > >
> > > Hello folks,
> > >
> > > As 2.3 is released now, I'd like to bump up this KIP discussion again
> for
> > > your reviews.
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang 
> > wrote:
> > >
> > > > Hello Patrik,
> > > >
> > > > Since we are rolling out 2.3 and everyone is busy with the release
> now
> > > > this KIP does not have much discussion involved yet and will slip
> into
> > the
> > > > next release cadence.
> > > >
> > > > This KIP itself contains several parts itself: 1. refactoring the
> > existing
> > > > metrics hierarchy to cleanup some redundancy and also get more
> > clarity; 2.
> > > > add instance-level metrics like rebalance and state metrics, as well
> as
> > > > other static metrics.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl 
> > wrote:
> > > >
> > > >> Hi Guozhang
> > > >> Thanks for the KIP, this looks very helpful.
> > > >> Could you please provide more detail on the metrics planned for the
> > state?
> > > >> We were just considering how to implement this ourselves because we
> > need
> > > >> to
> > > >> track the history of stage changes.
> > > >> The idea was to have an accumulated "seconds in state x" metric for
> > every
> > > >> state.
> > > >> The new rebalance metric might solve part of our use case, but it is
> > > >> interesting what you have planned for the state metric.
> > > >> best regards
> > > >> Patrik
> > > >>
> > > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang 
> > wrote:
> > > >>
> > > >> > Hello folks,
> > > >> >
> > > >> > I'd like to propose the following KIP to improve the Kafka Streams
> > > >> metrics
> > > >> > mechanism to users. This includes 1) a minor change in the public
> > > >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own
> > built-in
> > > >> > metrics hierarchy.
> > > >> >
> > > >> > Details can be found here:
> > > >> >
> > > >> >
> > > >> >
> > > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> > > >> >
> > > >> > I'd love to hear your thoughts and feedbacks. Thanks!
> > > >> >
> > > >> > --
> > > >> > -- Guozhang
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>


-- 
-- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-18 Thread Boyang Chen
I mean the partition time.

On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang  wrote:

> Hi Boyang,
>
> What do you mean by `per partition latency`?
>
> Guozhang
>
> On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen 
> wrote:
>
> > Hey Guozhang,
> >
> > do we plan to add per partition latency in this KIP?
> >
> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:
> >
> > > Hi Guozhang,
> > >
> > > Thank you for the KIP.
> > >
> > > 1) As far as I understand, the StreamsMetrics interface is there for
> > > user-defined processors. Would it make sense to also add a method to
> > > the interface to specify a sensor that records skipped records?
> > >
> > > 2) What are the semantics of active-task-process and
> standby-task-process
> > >
> > > 3) How do dropped-late-records and expired-window-record-drop relate
> > > to each other? I guess the former is for records that fall outside the
> > > grace period and the latter is for records that are processed after
> > > the retention period of the window. Is this correct?
> > >
> > > 4) Is there an actual difference between skipped and dropped records?
> > > If not, shall we unify the terminology?
> > >
> > > 5) What happens with removed metrics when the user sets the version of
> > > "built.in.metrics.version" to 2.2-
> > >
> > > Best,
> > > Bruno
> > >
> > > On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
> > wrote:
> > > >
> > > > Hello folks,
> > > >
> > > > As 2.3 is released now, I'd like to bump up this KIP discussion again
> > for
> > > > your reviews.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang 
> > > wrote:
> > > >
> > > > > Hello Patrik,
> > > > >
> > > > > Since we are rolling out 2.3 and everyone is busy with the release
> > now
> > > > > this KIP does not have much discussion involved yet and will slip
> > into
> > > the
> > > > > next release cadence.
> > > > >
> > > > > This KIP itself contains several parts itself: 1. refactoring the
> > > existing
> > > > > metrics hierarchy to cleanup some redundancy and also get more
> > > clarity; 2.
> > > > > add instance-level metrics like rebalance and state metrics, as
> well
> > as
> > > > > other static metrics.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl  >
> > > wrote:
> > > > >
> > > > >> Hi Guozhang
> > > > >> Thanks for the KIP, this looks very helpful.
> > > > >> Could you please provide more detail on the metrics planned for
> the
> > > state?
> > > > >> We were just considering how to implement this ourselves because
> we
> > > need
> > > > >> to
> > > > >> track the history of stage changes.
> > > > >> The idea was to have an accumulated "seconds in state x" metric
> for
> > > every
> > > > >> state.
> > > > >> The new rebalance metric might solve part of our use case, but it
> is
> > > > >> interesting what you have planned for the state metric.
> > > > >> best regards
> > > > >> Patrik
> > > > >>
> > > > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang 
> > > wrote:
> > > > >>
> > > > >> > Hello folks,
> > > > >> >
> > > > >> > I'd like to propose the following KIP to improve the Kafka
> Streams
> > > > >> metrics
> > > > >> > mechanism to users. This includes 1) a minor change in the
> public
> > > > >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own
> > > built-in
> > > > >> > metrics hierarchy.
> > > > >> >
> > > > >> > Details can be found here:
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> > > > >> >
> > > > >> > I'd love to hear your thoughts and feedbacks. Thanks!
> > > > >> >
> > > > >> > --
> > > > >> > -- Guozhang
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-22 Thread Guozhang Wang
Thanks Boyang,

I've thought about exposing time via metrics in Streams. The tricky part
though is which layer of time we should expose: right now we have
task-level and partition-level stream time (what you suggested), and also
some processor internally maintain their own observed time. Today we are
still trying to get a clear and simple way of exposing a single time
concept for users to reason about their application's progress. So before
we come up with a good solution I'd postpone adding it in a future KIP.


Guozhang


On Thu, Jul 18, 2019 at 1:21 PM Boyang Chen 
wrote:

> I mean the partition time.
>
> On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang  wrote:
>
> > Hi Boyang,
> >
> > What do you mean by `per partition latency`?
> >
> > Guozhang
> >
> > On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen 
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > do we plan to add per partition latency in this KIP?
> > >
> > > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > Thank you for the KIP.
> > > >
> > > > 1) As far as I understand, the StreamsMetrics interface is there for
> > > > user-defined processors. Would it make sense to also add a method to
> > > > the interface to specify a sensor that records skipped records?
> > > >
> > > > 2) What are the semantics of active-task-process and
> > standby-task-process
> > > >
> > > > 3) How do dropped-late-records and expired-window-record-drop relate
> > > > to each other? I guess the former is for records that fall outside
> the
> > > > grace period and the latter is for records that are processed after
> > > > the retention period of the window. Is this correct?
> > > >
> > > > 4) Is there an actual difference between skipped and dropped records?
> > > > If not, shall we unify the terminology?
> > > >
> > > > 5) What happens with removed metrics when the user sets the version
> of
> > > > "built.in.metrics.version" to 2.2-
> > > >
> > > > Best,
> > > > Bruno
> > > >
> > > > On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
> > > wrote:
> > > > >
> > > > > Hello folks,
> > > > >
> > > > > As 2.3 is released now, I'd like to bump up this KIP discussion
> again
> > > for
> > > > > your reviews.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang 
> > > > wrote:
> > > > >
> > > > > > Hello Patrik,
> > > > > >
> > > > > > Since we are rolling out 2.3 and everyone is busy with the
> release
> > > now
> > > > > > this KIP does not have much discussion involved yet and will slip
> > > into
> > > > the
> > > > > > next release cadence.
> > > > > >
> > > > > > This KIP itself contains several parts itself: 1. refactoring the
> > > > existing
> > > > > > metrics hierarchy to cleanup some redundancy and also get more
> > > > clarity; 2.
> > > > > > add instance-level metrics like rebalance and state metrics, as
> > well
> > > as
> > > > > > other static metrics.
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl <
> pklei...@gmail.com
> > >
> > > > wrote:
> > > > > >
> > > > > >> Hi Guozhang
> > > > > >> Thanks for the KIP, this looks very helpful.
> > > > > >> Could you please provide more detail on the metrics planned for
> > the
> > > > state?
> > > > > >> We were just considering how to implement this ourselves because
> > we
> > > > need
> > > > > >> to
> > > > > >> track the history of stage changes.
> > > > > >> The idea was to have an accumulated "seconds in state x" metric
> > for
> > > > every
> > > > > >> state.
> > > > > >> The new rebalance metric might solve part of our use case, but
> it
> > is
> > > > > >> interesting what you have planned for the state metric.
> > > > > >> best regards
> > > > > >> Patrik
> > > > > >>
> > > > > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang  >
> > > > wrote:
> > > > > >>
> > > > > >> > Hello folks,
> > > > > >> >
> > > > > >> > I'd like to propose the following KIP to improve the Kafka
> > Streams
> > > > > >> metrics
> > > > > >> > mechanism to users. This includes 1) a minor change in the
> > public
> > > > > >> > StreamsMetrics API, and 2) a major cleanup on the Streams' own
> > > > built-in
> > > > > >> > metrics hierarchy.
> > > > > >> >
> > > > > >> > Details can be found here:
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> > > > > >> >
> > > > > >> > I'd love to hear your thoughts and feedbacks. Thanks!
> > > > > >> >
> > > > > >> > --
> > > > > >> > -- Guozhang
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang


Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-22 Thread Guozhang Wang
Thanks everyone for your inputs, I've updated the wiki page accordingly.

@Bruno: please let me know if you have any further thoughts per my replies
above.


Guozhang


On Mon, Jul 22, 2019 at 6:30 PM Guozhang Wang  wrote:

> Thanks Boyang,
>
> I've thought about exposing time via metrics in Streams. The tricky part
> though is which layer of time we should expose: right now we have
> task-level and partition-level stream time (what you suggested), and also
> some processor internally maintain their own observed time. Today we are
> still trying to get a clear and simple way of exposing a single time
> concept for users to reason about their application's progress. So before
> we come up with a good solution I'd postpone adding it in a future KIP.
>
>
> Guozhang
>
>
> On Thu, Jul 18, 2019 at 1:21 PM Boyang Chen 
> wrote:
>
>> I mean the partition time.
>>
>> On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang 
>> wrote:
>>
>> > Hi Boyang,
>> >
>> > What do you mean by `per partition latency`?
>> >
>> > Guozhang
>> >
>> > On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen 
>> > wrote:
>> >
>> > > Hey Guozhang,
>> > >
>> > > do we plan to add per partition latency in this KIP?
>> > >
>> > > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
>> wrote:
>> > >
>> > > > Hi Guozhang,
>> > > >
>> > > > Thank you for the KIP.
>> > > >
>> > > > 1) As far as I understand, the StreamsMetrics interface is there for
>> > > > user-defined processors. Would it make sense to also add a method to
>> > > > the interface to specify a sensor that records skipped records?
>> > > >
>> > > > 2) What are the semantics of active-task-process and
>> > standby-task-process
>> > > >
>> > > > 3) How do dropped-late-records and expired-window-record-drop relate
>> > > > to each other? I guess the former is for records that fall outside
>> the
>> > > > grace period and the latter is for records that are processed after
>> > > > the retention period of the window. Is this correct?
>> > > >
>> > > > 4) Is there an actual difference between skipped and dropped
>> records?
>> > > > If not, shall we unify the terminology?
>> > > >
>> > > > 5) What happens with removed metrics when the user sets the version
>> of
>> > > > "built.in.metrics.version" to 2.2-
>> > > >
>> > > > Best,
>> > > > Bruno
>> > > >
>> > > > On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
>> > > wrote:
>> > > > >
>> > > > > Hello folks,
>> > > > >
>> > > > > As 2.3 is released now, I'd like to bump up this KIP discussion
>> again
>> > > for
>> > > > > your reviews.
>> > > > >
>> > > > >
>> > > > > Guozhang
>> > > > >
>> > > > >
>> > > > > On Thu, May 23, 2019 at 4:44 PM Guozhang Wang > >
>> > > > wrote:
>> > > > >
>> > > > > > Hello Patrik,
>> > > > > >
>> > > > > > Since we are rolling out 2.3 and everyone is busy with the
>> release
>> > > now
>> > > > > > this KIP does not have much discussion involved yet and will
>> slip
>> > > into
>> > > > the
>> > > > > > next release cadence.
>> > > > > >
>> > > > > > This KIP itself contains several parts itself: 1. refactoring
>> the
>> > > > existing
>> > > > > > metrics hierarchy to cleanup some redundancy and also get more
>> > > > clarity; 2.
>> > > > > > add instance-level metrics like rebalance and state metrics, as
>> > well
>> > > as
>> > > > > > other static metrics.
>> > > > > >
>> > > > > >
>> > > > > > Guozhang
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl <
>> pklei...@gmail.com
>> > >
>> > > > wrote:
>> > > > > >
>> > > > > >> Hi Guozhang
>> > > > > >> Thanks for the KIP, this looks very helpful.
>> > > > > >> Could you please provide more detail on the metrics planned for
>> > the
>> > > > state?
>> > > > > >> We were just considering how to implement this ourselves
>> because
>> > we
>> > > > need
>> > > > > >> to
>> > > > > >> track the history of stage changes.
>> > > > > >> The idea was to have an accumulated "seconds in state x" metric
>> > for
>> > > > every
>> > > > > >> state.
>> > > > > >> The new rebalance metric might solve part of our use case, but
>> it
>> > is
>> > > > > >> interesting what you have planned for the state metric.
>> > > > > >> best regards
>> > > > > >> Patrik
>> > > > > >>
>> > > > > >> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang <
>> wangg...@gmail.com>
>> > > > wrote:
>> > > > > >>
>> > > > > >> > Hello folks,
>> > > > > >> >
>> > > > > >> > I'd like to propose the following KIP to improve the Kafka
>> > Streams
>> > > > > >> metrics
>> > > > > >> > mechanism to users. This includes 1) a minor change in the
>> > public
>> > > > > >> > StreamsMetrics API, and 2) a major cleanup on the Streams'
>> own
>> > > > built-in
>> > > > > >> > metrics hierarchy.
>> > > > > >> >
>> > > > > >> > Details can be found here:
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
>> > > > > >> >
>> > > > > >> > I'd love to hear your thoughts

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-07-23 Thread Matthias J. Sax
Thanks for the KIP Guozhang.

I just re-read the wiki page and the DISCUSS thread. Overall LGTM.

The only nit is the naming of the new config values. With AK 2.3 being
released the versions numbers needs to be updated.

Additionally, I actually think that "2.2-" and "2.3" are not the best
names: the `-` suffix is very subtle IMHO and actually looks more like a
typo, and it might be better to be more elaborate. Maybe something like
"up-to-2.2" ?

For "2.3", this config value would be weird for future releases (ie,
2.4, 2.5, 2.6). Hence, we might want to rename it to "newest" /
"current" or something like this?

Another alternative may be to rename it to "since-2.3" (or similar) --
however, this may require to rename the config if we change metrics in a
future release (hence, it's not my preferred option).

Thoughts?


-Matthias

On 7/22/19 6:33 PM, Guozhang Wang wrote:
> Thanks everyone for your inputs, I've updated the wiki page accordingly.
> 
> @Bruno: please let me know if you have any further thoughts per my replies
> above.
> 
> 
> Guozhang
> 
> 
> On Mon, Jul 22, 2019 at 6:30 PM Guozhang Wang  wrote:
> 
>> Thanks Boyang,
>>
>> I've thought about exposing time via metrics in Streams. The tricky part
>> though is which layer of time we should expose: right now we have
>> task-level and partition-level stream time (what you suggested), and also
>> some processor internally maintain their own observed time. Today we are
>> still trying to get a clear and simple way of exposing a single time
>> concept for users to reason about their application's progress. So before
>> we come up with a good solution I'd postpone adding it in a future KIP.
>>
>>
>> Guozhang
>>
>>
>> On Thu, Jul 18, 2019 at 1:21 PM Boyang Chen 
>> wrote:
>>
>>> I mean the partition time.
>>>
>>> On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang 
>>> wrote:
>>>
 Hi Boyang,

 What do you mean by `per partition latency`?

 Guozhang

 On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen 
 wrote:

> Hey Guozhang,
>
> do we plan to add per partition latency in this KIP?
>
> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
>>> wrote:
>
>> Hi Guozhang,
>>
>> Thank you for the KIP.
>>
>> 1) As far as I understand, the StreamsMetrics interface is there for
>> user-defined processors. Would it make sense to also add a method to
>> the interface to specify a sensor that records skipped records?
>>
>> 2) What are the semantics of active-task-process and
 standby-task-process
>>
>> 3) How do dropped-late-records and expired-window-record-drop relate
>> to each other? I guess the former is for records that fall outside
>>> the
>> grace period and the latter is for records that are processed after
>> the retention period of the window. Is this correct?
>>
>> 4) Is there an actual difference between skipped and dropped
>>> records?
>> If not, shall we unify the terminology?
>>
>> 5) What happens with removed metrics when the user sets the version
>>> of
>> "built.in.metrics.version" to 2.2-
>>
>> Best,
>> Bruno
>>
>> On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
> wrote:
>>>
>>> Hello folks,
>>>
>>> As 2.3 is released now, I'd like to bump up this KIP discussion
>>> again
> for
>>> your reviews.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Thu, May 23, 2019 at 4:44 PM Guozhang Wang >>>
>> wrote:
>>>
 Hello Patrik,

 Since we are rolling out 2.3 and everyone is busy with the
>>> release
> now
 this KIP does not have much discussion involved yet and will
>>> slip
> into
>> the
 next release cadence.

 This KIP itself contains several parts itself: 1. refactoring
>>> the
>> existing
 metrics hierarchy to cleanup some redundancy and also get more
>> clarity; 2.
 add instance-level metrics like rebalance and state metrics, as
 well
> as
 other static metrics.


 Guozhang



 On Thu, May 23, 2019 at 5:34 AM Patrik Kleindl <
>>> pklei...@gmail.com
>
>> wrote:

> Hi Guozhang
> Thanks for the KIP, this looks very helpful.
> Could you please provide more detail on the metrics planned for
 the
>> state?
> We were just considering how to implement this ourselves
>>> because
 we
>> need
> to
> track the history of stage changes.
> The idea was to have an accumulated "seconds in state x" metric
 for
>> every
> state.
> The new rebalance metric might solve part of our use case, but
>>> it
 is
> interesting what you have planned for the state metric.
> best regards
> Patrik
>
> On Fri, 29 Mar 2019 at 18:56, Guozhang Wang <
>>> wangg...@gmail.com>
>> wrote:

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-02 Thread Guozhang Wang
Hello Matthias,

That's a good question. Thinking about a bit more, I'd like to propose that
we just list all the version numbers since 0.10 to 2.4 as accepted values,
and let Streams to decide if old / new set of metrics can be used
internally (implementation wise we can reuse the const values for
`upgrade.from` as well).

And then in the future when we remove the metrics, we can just remove the
corresponding version values from the accepted list of this config.


Guozhang

On Tue, Jul 23, 2019 at 11:55 AM Matthias J. Sax 
wrote:

> Thanks for the KIP Guozhang.
>
> I just re-read the wiki page and the DISCUSS thread. Overall LGTM.
>
> The only nit is the naming of the new config values. With AK 2.3 being
> released the versions numbers needs to be updated.
>
> Additionally, I actually think that "2.2-" and "2.3" are not the best
> names: the `-` suffix is very subtle IMHO and actually looks more like a
> typo, and it might be better to be more elaborate. Maybe something like
> "up-to-2.2" ?
>
> For "2.3", this config value would be weird for future releases (ie,
> 2.4, 2.5, 2.6). Hence, we might want to rename it to "newest" /
> "current" or something like this?
>
> Another alternative may be to rename it to "since-2.3" (or similar) --
> however, this may require to rename the config if we change metrics in a
> future release (hence, it's not my preferred option).
>
> Thoughts?
>
>
> -Matthias
>
> On 7/22/19 6:33 PM, Guozhang Wang wrote:
> > Thanks everyone for your inputs, I've updated the wiki page accordingly.
> >
> > @Bruno: please let me know if you have any further thoughts per my
> replies
> > above.
> >
> >
> > Guozhang
> >
> >
> > On Mon, Jul 22, 2019 at 6:30 PM Guozhang Wang 
> wrote:
> >
> >> Thanks Boyang,
> >>
> >> I've thought about exposing time via metrics in Streams. The tricky part
> >> though is which layer of time we should expose: right now we have
> >> task-level and partition-level stream time (what you suggested), and
> also
> >> some processor internally maintain their own observed time. Today we are
> >> still trying to get a clear and simple way of exposing a single time
> >> concept for users to reason about their application's progress. So
> before
> >> we come up with a good solution I'd postpone adding it in a future KIP.
> >>
> >>
> >> Guozhang
> >>
> >>
> >> On Thu, Jul 18, 2019 at 1:21 PM Boyang Chen  >
> >> wrote:
> >>
> >>> I mean the partition time.
> >>>
> >>> On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang 
> >>> wrote:
> >>>
>  Hi Boyang,
> 
>  What do you mean by `per partition latency`?
> 
>  Guozhang
> 
>  On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen <
> reluctanthero...@gmail.com>
>  wrote:
> 
> > Hey Guozhang,
> >
> > do we plan to add per partition latency in this KIP?
> >
> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> >>> wrote:
> >
> >> Hi Guozhang,
> >>
> >> Thank you for the KIP.
> >>
> >> 1) As far as I understand, the StreamsMetrics interface is there for
> >> user-defined processors. Would it make sense to also add a method to
> >> the interface to specify a sensor that records skipped records?
> >>
> >> 2) What are the semantics of active-task-process and
>  standby-task-process
> >>
> >> 3) How do dropped-late-records and expired-window-record-drop relate
> >> to each other? I guess the former is for records that fall outside
> >>> the
> >> grace period and the latter is for records that are processed after
> >> the retention period of the window. Is this correct?
> >>
> >> 4) Is there an actual difference between skipped and dropped
> >>> records?
> >> If not, shall we unify the terminology?
> >>
> >> 5) What happens with removed metrics when the user sets the version
> >>> of
> >> "built.in.metrics.version" to 2.2-
> >>
> >> Best,
> >> Bruno
> >>
> >> On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
> > wrote:
> >>>
> >>> Hello folks,
> >>>
> >>> As 2.3 is released now, I'd like to bump up this KIP discussion
> >>> again
> > for
> >>> your reviews.
> >>>
> >>>
> >>> Guozhang
> >>>
> >>>
> >>> On Thu, May 23, 2019 at 4:44 PM Guozhang Wang  
> >> wrote:
> >>>
>  Hello Patrik,
> 
>  Since we are rolling out 2.3 and everyone is busy with the
> >>> release
> > now
>  this KIP does not have much discussion involved yet and will
> >>> slip
> > into
> >> the
>  next release cadence.
> 
>  This KIP itself contains several parts itself: 1. refactoring
> >>> the
> >> existing
>  metrics hierarchy to cleanup some redundancy and also get more
> >> clarity; 2.
>  add instance-level metrics like rebalance and state metrics, as
>  well
> > as
>  other static metrics.
> 
> 
>  Guozhang
> 
> 
> >>>

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-02 Thread Matthias J. Sax
Would this imply that we need to update the config in each release to
add a new accepted value?


-Matthias

On 8/2/19 1:07 PM, Guozhang Wang wrote:
> Hello Matthias,
> 
> That's a good question. Thinking about a bit more, I'd like to propose that
> we just list all the version numbers since 0.10 to 2.4 as accepted values,
> and let Streams to decide if old / new set of metrics can be used
> internally (implementation wise we can reuse the const values for
> `upgrade.from` as well).
> 
> And then in the future when we remove the metrics, we can just remove the
> corresponding version values from the accepted list of this config.
> 
> 
> Guozhang
> 
> On Tue, Jul 23, 2019 at 11:55 AM Matthias J. Sax 
> wrote:
> 
>> Thanks for the KIP Guozhang.
>>
>> I just re-read the wiki page and the DISCUSS thread. Overall LGTM.
>>
>> The only nit is the naming of the new config values. With AK 2.3 being
>> released the versions numbers needs to be updated.
>>
>> Additionally, I actually think that "2.2-" and "2.3" are not the best
>> names: the `-` suffix is very subtle IMHO and actually looks more like a
>> typo, and it might be better to be more elaborate. Maybe something like
>> "up-to-2.2" ?
>>
>> For "2.3", this config value would be weird for future releases (ie,
>> 2.4, 2.5, 2.6). Hence, we might want to rename it to "newest" /
>> "current" or something like this?
>>
>> Another alternative may be to rename it to "since-2.3" (or similar) --
>> however, this may require to rename the config if we change metrics in a
>> future release (hence, it's not my preferred option).
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>> On 7/22/19 6:33 PM, Guozhang Wang wrote:
>>> Thanks everyone for your inputs, I've updated the wiki page accordingly.
>>>
>>> @Bruno: please let me know if you have any further thoughts per my
>> replies
>>> above.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Mon, Jul 22, 2019 at 6:30 PM Guozhang Wang 
>> wrote:
>>>
 Thanks Boyang,

 I've thought about exposing time via metrics in Streams. The tricky part
 though is which layer of time we should expose: right now we have
 task-level and partition-level stream time (what you suggested), and
>> also
 some processor internally maintain their own observed time. Today we are
 still trying to get a clear and simple way of exposing a single time
 concept for users to reason about their application's progress. So
>> before
 we come up with a good solution I'd postpone adding it in a future KIP.


 Guozhang


 On Thu, Jul 18, 2019 at 1:21 PM Boyang Chen >>
 wrote:

> I mean the partition time.
>
> On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang 
> wrote:
>
>> Hi Boyang,
>>
>> What do you mean by `per partition latency`?
>>
>> Guozhang
>>
>> On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen <
>> reluctanthero...@gmail.com>
>> wrote:
>>
>>> Hey Guozhang,
>>>
>>> do we plan to add per partition latency in this KIP?
>>>
>>> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> wrote:
>>>
 Hi Guozhang,

 Thank you for the KIP.

 1) As far as I understand, the StreamsMetrics interface is there for
 user-defined processors. Would it make sense to also add a method to
 the interface to specify a sensor that records skipped records?

 2) What are the semantics of active-task-process and
>> standby-task-process

 3) How do dropped-late-records and expired-window-record-drop relate
 to each other? I guess the former is for records that fall outside
> the
 grace period and the latter is for records that are processed after
 the retention period of the window. Is this correct?

 4) Is there an actual difference between skipped and dropped
> records?
 If not, shall we unify the terminology?

 5) What happens with removed metrics when the user sets the version
> of
 "built.in.metrics.version" to 2.2-

 Best,
 Bruno

 On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang 
>>> wrote:
>
> Hello folks,
>
> As 2.3 is released now, I'd like to bump up this KIP discussion
> again
>>> for
> your reviews.
>
>
> Guozhang
>
>
> On Thu, May 23, 2019 at 4:44 PM Guozhang Wang >
 wrote:
>
>> Hello Patrik,
>>
>> Since we are rolling out 2.3 and everyone is busy with the
> release
>>> now
>> this KIP does not have much discussion involved yet and will
> slip
>>> into
 the
>> next release cadence.
>>
>> This KIP itself contains several parts itself: 1. refactoring
> the
 existing
>> metrics hierarchy to cleanup some redundancy and also get more
>>>

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-02 Thread Guozhang Wang
For the existing releases, yes (with KIP-447 we are already going to do
that anyways), for future release maybe not --- hopefully we only do such
metrics refactoring once.


Guozhang

On Fri, Aug 2, 2019 at 3:23 PM Matthias J. Sax 
wrote:

> Would this imply that we need to update the config in each release to
> add a new accepted value?
>
>
> -Matthias
>
> On 8/2/19 1:07 PM, Guozhang Wang wrote:
> > Hello Matthias,
> >
> > That's a good question. Thinking about a bit more, I'd like to propose
> that
> > we just list all the version numbers since 0.10 to 2.4 as accepted
> values,
> > and let Streams to decide if old / new set of metrics can be used
> > internally (implementation wise we can reuse the const values for
> > `upgrade.from` as well).
> >
> > And then in the future when we remove the metrics, we can just remove the
> > corresponding version values from the accepted list of this config.
> >
> >
> > Guozhang
> >
> > On Tue, Jul 23, 2019 at 11:55 AM Matthias J. Sax 
> > wrote:
> >
> >> Thanks for the KIP Guozhang.
> >>
> >> I just re-read the wiki page and the DISCUSS thread. Overall LGTM.
> >>
> >> The only nit is the naming of the new config values. With AK 2.3 being
> >> released the versions numbers needs to be updated.
> >>
> >> Additionally, I actually think that "2.2-" and "2.3" are not the best
> >> names: the `-` suffix is very subtle IMHO and actually looks more like a
> >> typo, and it might be better to be more elaborate. Maybe something like
> >> "up-to-2.2" ?
> >>
> >> For "2.3", this config value would be weird for future releases (ie,
> >> 2.4, 2.5, 2.6). Hence, we might want to rename it to "newest" /
> >> "current" or something like this?
> >>
> >> Another alternative may be to rename it to "since-2.3" (or similar) --
> >> however, this may require to rename the config if we change metrics in a
> >> future release (hence, it's not my preferred option).
> >>
> >> Thoughts?
> >>
> >>
> >> -Matthias
> >>
> >> On 7/22/19 6:33 PM, Guozhang Wang wrote:
> >>> Thanks everyone for your inputs, I've updated the wiki page
> accordingly.
> >>>
> >>> @Bruno: please let me know if you have any further thoughts per my
> >> replies
> >>> above.
> >>>
> >>>
> >>> Guozhang
> >>>
> >>>
> >>> On Mon, Jul 22, 2019 at 6:30 PM Guozhang Wang 
> >> wrote:
> >>>
>  Thanks Boyang,
> 
>  I've thought about exposing time via metrics in Streams. The tricky
> part
>  though is which layer of time we should expose: right now we have
>  task-level and partition-level stream time (what you suggested), and
> >> also
>  some processor internally maintain their own observed time. Today we
> are
>  still trying to get a clear and simple way of exposing a single time
>  concept for users to reason about their application's progress. So
> >> before
>  we come up with a good solution I'd postpone adding it in a future
> KIP.
> 
> 
>  Guozhang
> 
> 
>  On Thu, Jul 18, 2019 at 1:21 PM Boyang Chen <
> reluctanthero...@gmail.com
> >>>
>  wrote:
> 
> > I mean the partition time.
> >
> > On Thu, Jul 18, 2019 at 11:29 AM Guozhang Wang 
> > wrote:
> >
> >> Hi Boyang,
> >>
> >> What do you mean by `per partition latency`?
> >>
> >> Guozhang
> >>
> >> On Mon, Jul 1, 2019 at 9:28 AM Boyang Chen <
> >> reluctanthero...@gmail.com>
> >> wrote:
> >>
> >>> Hey Guozhang,
> >>>
> >>> do we plan to add per partition latency in this KIP?
> >>>
> >>> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> > wrote:
> >>>
>  Hi Guozhang,
> 
>  Thank you for the KIP.
> 
>  1) As far as I understand, the StreamsMetrics interface is there
> for
>  user-defined processors. Would it make sense to also add a method
> to
>  the interface to specify a sensor that records skipped records?
> 
>  2) What are the semantics of active-task-process and
> >> standby-task-process
> 
>  3) How do dropped-late-records and expired-window-record-drop
> relate
>  to each other? I guess the former is for records that fall outside
> > the
>  grace period and the latter is for records that are processed
> after
>  the retention period of the window. Is this correct?
> 
>  4) Is there an actual difference between skipped and dropped
> > records?
>  If not, shall we unify the terminology?
> 
>  5) What happens with removed metrics when the user sets the
> version
> > of
>  "built.in.metrics.version" to 2.2-
> 
>  Best,
>  Bruno
> 
>  On Thu, Jun 27, 2019 at 6:11 PM Guozhang Wang  >
> >>> wrote:
> >
> > Hello folks,
> >
> > As 2.3 is released now, I'd like to bump up this KIP discussion
> > again
> >>> for
> > your reviews.
> >
> >
> > Guozhan

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-06 Thread Bruno Cadonna
Hi Guozhang,

I left my comments inline.

On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang  wrote:
>
> Hello Bruno,
>
> Thanks for the feedbacks, replied inline.
>
> On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:
>
> > Hi Guozhang,
> >
> > Thank you for the KIP.
> >
> > 1) As far as I understand, the StreamsMetrics interface is there for
> > user-defined processors. Would it make sense to also add a method to
> > the interface to specify a sensor that records skipped records?
> >
> > Not sure I follow.. if users want to add a specific skipped records
> sensor, she can still do that as a "throughput" sensor via "
> addThroughputSensor" and then "record" right?
>
> As an after-thought, maybe it's better to rename `throughput` to `rate` in
> the public APIs since it is really meant for the latter semantics. I did
> not change it just to make less API changes / deprecate fewer functions.
> But if we feel it is important we can change it as well.
>

I see now that a user can record the rate of skipped records. However,
I was referring to the total number of skipped records. Maybe my
question should be more general: should we allow the user to also
specify sensors for totals or combinations of rate and totals?

Regarding the naming, I like `rate` more than `throughput`, but I
would not fight for it.

>
> > 2) What are the semantics of active-task-process and standby-task-process
> >
> > Ah good catch, I think I made it in the wrong column. Just some
> explanations here: Within a thread's looped iterations, it will first try
> to process some records from the active tasks, and then see if there are
> any standby-tasks that can be processed as well (i.e. just reading from the
> restore consumer and apply to the local stores). The ratio metrics are for
> indicating 1) what tasks (active or standby) does this thread own so far,
> and 2) how much time in percentage does it spend on each of them.
>
> But this metric should really be a task-level one that includes both the
> thread-id and task-id, and upon task migrations they will be dynamically
> deleted / (re)-created. For each task-id it may be owned by multiple
> threads as one active and others standby, and hence the separation of
> active / standby seems still necessary.
>

Makes sense.


>
>
> > 3) How do dropped-late-records and expired-window-record-drop relate
> > to each other? I guess the former is for records that fall outside the
> > grace period and the latter is for records that are processed after
> > the retention period of the window. Is this correct?
> >
> > Yes, that's correct. The names are indeed a bit confusing since they are
> added at different releases historically..
>
> More precisely, the `grace period` is a notion of the operator (hence the
> metric is node-level, though it would only be used for DSL operators) while
> the `retention` is a notion of the store (hence the metric is store-level).
> Usually grace period will be smaller than store retention though.
>
> Processor node is aware of `grace period` and when received a record that
> is older than grace deadline, it will be dropped immediately; otherwise it
> will still be processed a maybe a new update is "put" into the store. The
> store is aware of its `retention period` and then upon a "put" call if it
> realized it is older than the retention deadline, that put call would be
> ignored and metric is recorded.
>
> We have to separate them here since the window store can be used in both
> DSL and PAPI, and for the former case it would likely to be already ignored
> at the processor node level due to the grace period which is usually
> smaller than retention; but for PAPI there's no grace period and hence the
> processor would likely still process and call "put" on the store.
>

Alright! Got it!

>
> > 4) Is there an actual difference between skipped and dropped records?
> > If not, shall we unify the terminology?
> >
> >
> There is. Dropped records are only due to lateness; where as skipped
> records can be due to serde errors (and user's error handling indicate
> "skip and continue"), timestamp errors, etc.
>
> I've considered maybe a better (more extensible) way would be defining a
> single metric name, say skipped-records, but use different tags to indicate
> if its skipping reason (errors, windowing semantics, etc). But there's
> still a tricky difference: for serde caused skipping for example, they will
> be skipped at the very beginning and there's no effects taken at all. For
> some others e.g. null-key / value at the reduce operator, it is only
> skipped at the middle of the processing, i.e. some effects may have already
> been taken in up-stream sub-topologies. And that's why for skipped-records
> I've defined it on both task-level and node-level and the aggregate of the
> latter may still be smaller than the former, whereas for dropped-records it
> is only for node-level.
>
> So how about an even more significant change then: we enlarge the
> `dropped-late-records` to 

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-19 Thread Guozhang Wang
Hi Bruno,


On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> I left my comments inline.
>
> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang  wrote:
> >
> > Hello Bruno,
> >
> > Thanks for the feedbacks, replied inline.
> >
> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna  wrote:
> >
> > > Hi Guozhang,
> > >
> > > Thank you for the KIP.
> > >
> > > 1) As far as I understand, the StreamsMetrics interface is there for
> > > user-defined processors. Would it make sense to also add a method to
> > > the interface to specify a sensor that records skipped records?
> > >
> > > Not sure I follow.. if users want to add a specific skipped records
> > sensor, she can still do that as a "throughput" sensor via "
> > addThroughputSensor" and then "record" right?
> >
> > As an after-thought, maybe it's better to rename `throughput` to `rate`
> in
> > the public APIs since it is really meant for the latter semantics. I did
> > not change it just to make less API changes / deprecate fewer functions.
> > But if we feel it is important we can change it as well.
> >
>
> I see now that a user can record the rate of skipped records. However,
> I was referring to the total number of skipped records. Maybe my
> question should be more general: should we allow the user to also
> specify sensors for totals or combinations of rate and totals?
>
> Sounds good to me, I will add it to the wiki page as well for
StreamsMetrics.



> Regarding the naming, I like `rate` more than `throughput`, but I
> would not fight for it.
>
> >
> > > 2) What are the semantics of active-task-process and
> standby-task-process
> > >
> > > Ah good catch, I think I made it in the wrong column. Just some
> > explanations here: Within a thread's looped iterations, it will first try
> > to process some records from the active tasks, and then see if there are
> > any standby-tasks that can be processed as well (i.e. just reading from
> the
> > restore consumer and apply to the local stores). The ratio metrics are
> for
> > indicating 1) what tasks (active or standby) does this thread own so far,
> > and 2) how much time in percentage does it spend on each of them.
> >
> > But this metric should really be a task-level one that includes both the
> > thread-id and task-id, and upon task migrations they will be dynamically
> > deleted / (re)-created. For each task-id it may be owned by multiple
> > threads as one active and others standby, and hence the separation of
> > active / standby seems still necessary.
> >
>
> Makes sense.
>
>
> >
> >
> > > 3) How do dropped-late-records and expired-window-record-drop relate
> > > to each other? I guess the former is for records that fall outside the
> > > grace period and the latter is for records that are processed after
> > > the retention period of the window. Is this correct?
> > >
> > > Yes, that's correct. The names are indeed a bit confusing since they
> are
> > added at different releases historically..
> >
> > More precisely, the `grace period` is a notion of the operator (hence the
> > metric is node-level, though it would only be used for DSL operators)
> while
> > the `retention` is a notion of the store (hence the metric is
> store-level).
> > Usually grace period will be smaller than store retention though.
> >
> > Processor node is aware of `grace period` and when received a record that
> > is older than grace deadline, it will be dropped immediately; otherwise
> it
> > will still be processed a maybe a new update is "put" into the store. The
> > store is aware of its `retention period` and then upon a "put" call if it
> > realized it is older than the retention deadline, that put call would be
> > ignored and metric is recorded.
> >
> > We have to separate them here since the window store can be used in both
> > DSL and PAPI, and for the former case it would likely to be already
> ignored
> > at the processor node level due to the grace period which is usually
> > smaller than retention; but for PAPI there's no grace period and hence
> the
> > processor would likely still process and call "put" on the store.
> >
>
> Alright! Got it!
>
> >
> > > 4) Is there an actual difference between skipped and dropped records?
> > > If not, shall we unify the terminology?
> > >
> > >
> > There is. Dropped records are only due to lateness; where as skipped
> > records can be due to serde errors (and user's error handling indicate
> > "skip and continue"), timestamp errors, etc.
> >
> > I've considered maybe a better (more extensible) way would be defining a
> > single metric name, say skipped-records, but use different tags to
> indicate
> > if its skipping reason (errors, windowing semantics, etc). But there's
> > still a tricky difference: for serde caused skipping for example, they
> will
> > be skipped at the very beginning and there's no effects taken at all. For
> > some others e.g. null-key / value at the reduce operator, it is only
> > skipped at the middle of the processing, i.e. some effect

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-19 Thread Guozhang Wang
Hi Bruno,

Just realized that for `addRateSensor` and `addLatencyAndRateSensor` we've
actually added the total invocation metric already.


Guozhang

On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang  wrote:

> Hi Bruno,
>
>
> On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna  wrote:
>
>> Hi Guozhang,
>>
>> I left my comments inline.
>>
>> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang  wrote:
>> >
>> > Hello Bruno,
>> >
>> > Thanks for the feedbacks, replied inline.
>> >
>> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
>> wrote:
>> >
>> > > Hi Guozhang,
>> > >
>> > > Thank you for the KIP.
>> > >
>> > > 1) As far as I understand, the StreamsMetrics interface is there for
>> > > user-defined processors. Would it make sense to also add a method to
>> > > the interface to specify a sensor that records skipped records?
>> > >
>> > > Not sure I follow.. if users want to add a specific skipped records
>> > sensor, she can still do that as a "throughput" sensor via "
>> > addThroughputSensor" and then "record" right?
>> >
>> > As an after-thought, maybe it's better to rename `throughput` to `rate`
>> in
>> > the public APIs since it is really meant for the latter semantics. I did
>> > not change it just to make less API changes / deprecate fewer functions.
>> > But if we feel it is important we can change it as well.
>> >
>>
>> I see now that a user can record the rate of skipped records. However,
>> I was referring to the total number of skipped records. Maybe my
>> question should be more general: should we allow the user to also
>> specify sensors for totals or combinations of rate and totals?
>>
>> Sounds good to me, I will add it to the wiki page as well for
> StreamsMetrics.
>
>
>
>> Regarding the naming, I like `rate` more than `throughput`, but I
>> would not fight for it.
>>
>> >
>> > > 2) What are the semantics of active-task-process and
>> standby-task-process
>> > >
>> > > Ah good catch, I think I made it in the wrong column. Just some
>> > explanations here: Within a thread's looped iterations, it will first
>> try
>> > to process some records from the active tasks, and then see if there are
>> > any standby-tasks that can be processed as well (i.e. just reading from
>> the
>> > restore consumer and apply to the local stores). The ratio metrics are
>> for
>> > indicating 1) what tasks (active or standby) does this thread own so
>> far,
>> > and 2) how much time in percentage does it spend on each of them.
>> >
>> > But this metric should really be a task-level one that includes both the
>> > thread-id and task-id, and upon task migrations they will be dynamically
>> > deleted / (re)-created. For each task-id it may be owned by multiple
>> > threads as one active and others standby, and hence the separation of
>> > active / standby seems still necessary.
>> >
>>
>> Makes sense.
>>
>>
>> >
>> >
>> > > 3) How do dropped-late-records and expired-window-record-drop relate
>> > > to each other? I guess the former is for records that fall outside the
>> > > grace period and the latter is for records that are processed after
>> > > the retention period of the window. Is this correct?
>> > >
>> > > Yes, that's correct. The names are indeed a bit confusing since they
>> are
>> > added at different releases historically..
>> >
>> > More precisely, the `grace period` is a notion of the operator (hence
>> the
>> > metric is node-level, though it would only be used for DSL operators)
>> while
>> > the `retention` is a notion of the store (hence the metric is
>> store-level).
>> > Usually grace period will be smaller than store retention though.
>> >
>> > Processor node is aware of `grace period` and when received a record
>> that
>> > is older than grace deadline, it will be dropped immediately; otherwise
>> it
>> > will still be processed a maybe a new update is "put" into the store.
>> The
>> > store is aware of its `retention period` and then upon a "put" call if
>> it
>> > realized it is older than the retention deadline, that put call would be
>> > ignored and metric is recorded.
>> >
>> > We have to separate them here since the window store can be used in both
>> > DSL and PAPI, and for the former case it would likely to be already
>> ignored
>> > at the processor node level due to the grace period which is usually
>> > smaller than retention; but for PAPI there's no grace period and hence
>> the
>> > processor would likely still process and call "put" on the store.
>> >
>>
>> Alright! Got it!
>>
>> >
>> > > 4) Is there an actual difference between skipped and dropped records?
>> > > If not, shall we unify the terminology?
>> > >
>> > >
>> > There is. Dropped records are only due to lateness; where as skipped
>> > records can be due to serde errors (and user's error handling indicate
>> > "skip and continue"), timestamp errors, etc.
>> >
>> > I've considered maybe a better (more extensible) way would be defining a
>> > single metric name, say skipped-records, but use different tags to
>> indicate
>> > if its skippi

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-20 Thread Bruno Cadonna
Hi Guozhang,

I totally missed the total invocation count metric in the javadoc.
Which brings me to a follow-up question. Should the names of the
methods reflect the included total invocation count? We have to rename
them anyways. One option would be to simply add `Total` to the method
names, i.e., `addLatencyAndRateAndTotalSensor` and
`addRateAndTotalSensor` (alternatively without the `And`s). Since
those sensors record exclusively invocations, another option would be
`addInvocationSensor` and `addInvocationSensorWithoutLatency`.

As far as I can see, we have sensors to record invocations but none to
record amounts. Is that intentional? No need to add it to this KIP, I
am just curious.

Best,
Bruno

On Tue, Aug 20, 2019 at 1:13 AM Guozhang Wang  wrote:
>
> Hi Bruno,
>
> Just realized that for `addRateSensor` and `addLatencyAndRateSensor` we've
> actually added the total invocation metric already.
>
>
> Guozhang
>
> On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang  wrote:
>
> > Hi Bruno,
> >
> >
> > On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna  wrote:
> >
> >> Hi Guozhang,
> >>
> >> I left my comments inline.
> >>
> >> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang  wrote:
> >> >
> >> > Hello Bruno,
> >> >
> >> > Thanks for the feedbacks, replied inline.
> >> >
> >> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> >> wrote:
> >> >
> >> > > Hi Guozhang,
> >> > >
> >> > > Thank you for the KIP.
> >> > >
> >> > > 1) As far as I understand, the StreamsMetrics interface is there for
> >> > > user-defined processors. Would it make sense to also add a method to
> >> > > the interface to specify a sensor that records skipped records?
> >> > >
> >> > > Not sure I follow.. if users want to add a specific skipped records
> >> > sensor, she can still do that as a "throughput" sensor via "
> >> > addThroughputSensor" and then "record" right?
> >> >
> >> > As an after-thought, maybe it's better to rename `throughput` to `rate`
> >> in
> >> > the public APIs since it is really meant for the latter semantics. I did
> >> > not change it just to make less API changes / deprecate fewer functions.
> >> > But if we feel it is important we can change it as well.
> >> >
> >>
> >> I see now that a user can record the rate of skipped records. However,
> >> I was referring to the total number of skipped records. Maybe my
> >> question should be more general: should we allow the user to also
> >> specify sensors for totals or combinations of rate and totals?
> >>
> >> Sounds good to me, I will add it to the wiki page as well for
> > StreamsMetrics.
> >
> >
> >
> >> Regarding the naming, I like `rate` more than `throughput`, but I
> >> would not fight for it.
> >>
> >> >
> >> > > 2) What are the semantics of active-task-process and
> >> standby-task-process
> >> > >
> >> > > Ah good catch, I think I made it in the wrong column. Just some
> >> > explanations here: Within a thread's looped iterations, it will first
> >> try
> >> > to process some records from the active tasks, and then see if there are
> >> > any standby-tasks that can be processed as well (i.e. just reading from
> >> the
> >> > restore consumer and apply to the local stores). The ratio metrics are
> >> for
> >> > indicating 1) what tasks (active or standby) does this thread own so
> >> far,
> >> > and 2) how much time in percentage does it spend on each of them.
> >> >
> >> > But this metric should really be a task-level one that includes both the
> >> > thread-id and task-id, and upon task migrations they will be dynamically
> >> > deleted / (re)-created. For each task-id it may be owned by multiple
> >> > threads as one active and others standby, and hence the separation of
> >> > active / standby seems still necessary.
> >> >
> >>
> >> Makes sense.
> >>
> >>
> >> >
> >> >
> >> > > 3) How do dropped-late-records and expired-window-record-drop relate
> >> > > to each other? I guess the former is for records that fall outside the
> >> > > grace period and the latter is for records that are processed after
> >> > > the retention period of the window. Is this correct?
> >> > >
> >> > > Yes, that's correct. The names are indeed a bit confusing since they
> >> are
> >> > added at different releases historically..
> >> >
> >> > More precisely, the `grace period` is a notion of the operator (hence
> >> the
> >> > metric is node-level, though it would only be used for DSL operators)
> >> while
> >> > the `retention` is a notion of the store (hence the metric is
> >> store-level).
> >> > Usually grace period will be smaller than store retention though.
> >> >
> >> > Processor node is aware of `grace period` and when received a record
> >> that
> >> > is older than grace deadline, it will be dropped immediately; otherwise
> >> it
> >> > will still be processed a maybe a new update is "put" into the store.
> >> The
> >> > store is aware of its `retention period` and then upon a "put" call if
> >> it
> >> > realized it is older than the retention deadline, that put call would be
> 

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-20 Thread Guozhang Wang
Hi Bruno,

No it was not intentional, and we can definitely add the total amount
sensor as well -- they are just util functions to save users some lines of
code anyways, and should be straightforward.

Guozhang


On Tue, Aug 20, 2019 at 1:05 AM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> I totally missed the total invocation count metric in the javadoc.
> Which brings me to a follow-up question. Should the names of the
> methods reflect the included total invocation count? We have to rename
> them anyways. One option would be to simply add `Total` to the method
> names, i.e., `addLatencyAndRateAndTotalSensor` and
> `addRateAndTotalSensor` (alternatively without the `And`s). Since
> those sensors record exclusively invocations, another option would be
> `addInvocationSensor` and `addInvocationSensorWithoutLatency`.
>
> As far as I can see, we have sensors to record invocations but none to
> record amounts. Is that intentional? No need to add it to this KIP, I
> am just curious.
>
> Best,
> Bruno
>
> On Tue, Aug 20, 2019 at 1:13 AM Guozhang Wang  wrote:
> >
> > Hi Bruno,
> >
> > Just realized that for `addRateSensor` and `addLatencyAndRateSensor`
> we've
> > actually added the total invocation metric already.
> >
> >
> > Guozhang
> >
> > On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang 
> wrote:
> >
> > > Hi Bruno,
> > >
> > >
> > > On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna 
> wrote:
> > >
> > >> Hi Guozhang,
> > >>
> > >> I left my comments inline.
> > >>
> > >> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang 
> wrote:
> > >> >
> > >> > Hello Bruno,
> > >> >
> > >> > Thanks for the feedbacks, replied inline.
> > >> >
> > >> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> > >> wrote:
> > >> >
> > >> > > Hi Guozhang,
> > >> > >
> > >> > > Thank you for the KIP.
> > >> > >
> > >> > > 1) As far as I understand, the StreamsMetrics interface is there
> for
> > >> > > user-defined processors. Would it make sense to also add a method
> to
> > >> > > the interface to specify a sensor that records skipped records?
> > >> > >
> > >> > > Not sure I follow.. if users want to add a specific skipped
> records
> > >> > sensor, she can still do that as a "throughput" sensor via "
> > >> > addThroughputSensor" and then "record" right?
> > >> >
> > >> > As an after-thought, maybe it's better to rename `throughput` to
> `rate`
> > >> in
> > >> > the public APIs since it is really meant for the latter semantics.
> I did
> > >> > not change it just to make less API changes / deprecate fewer
> functions.
> > >> > But if we feel it is important we can change it as well.
> > >> >
> > >>
> > >> I see now that a user can record the rate of skipped records. However,
> > >> I was referring to the total number of skipped records. Maybe my
> > >> question should be more general: should we allow the user to also
> > >> specify sensors for totals or combinations of rate and totals?
> > >>
> > >> Sounds good to me, I will add it to the wiki page as well for
> > > StreamsMetrics.
> > >
> > >
> > >
> > >> Regarding the naming, I like `rate` more than `throughput`, but I
> > >> would not fight for it.
> > >>
> > >> >
> > >> > > 2) What are the semantics of active-task-process and
> > >> standby-task-process
> > >> > >
> > >> > > Ah good catch, I think I made it in the wrong column. Just some
> > >> > explanations here: Within a thread's looped iterations, it will
> first
> > >> try
> > >> > to process some records from the active tasks, and then see if
> there are
> > >> > any standby-tasks that can be processed as well (i.e. just reading
> from
> > >> the
> > >> > restore consumer and apply to the local stores). The ratio metrics
> are
> > >> for
> > >> > indicating 1) what tasks (active or standby) does this thread own so
> > >> far,
> > >> > and 2) how much time in percentage does it spend on each of them.
> > >> >
> > >> > But this metric should really be a task-level one that includes
> both the
> > >> > thread-id and task-id, and upon task migrations they will be
> dynamically
> > >> > deleted / (re)-created. For each task-id it may be owned by multiple
> > >> > threads as one active and others standby, and hence the separation
> of
> > >> > active / standby seems still necessary.
> > >> >
> > >>
> > >> Makes sense.
> > >>
> > >>
> > >> >
> > >> >
> > >> > > 3) How do dropped-late-records and expired-window-record-drop
> relate
> > >> > > to each other? I guess the former is for records that fall
> outside the
> > >> > > grace period and the latter is for records that are processed
> after
> > >> > > the retention period of the window. Is this correct?
> > >> > >
> > >> > > Yes, that's correct. The names are indeed a bit confusing since
> they
> > >> are
> > >> > added at different releases historically..
> > >> >
> > >> > More precisely, the `grace period` is a notion of the operator
> (hence
> > >> the
> > >> > metric is node-level, though it would only be used for DSL
> operators)
> > >> while
> > >> > the `retention` is a notion of the

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-08-20 Thread Guozhang Wang
Hello Bruno,

I've updated the wiki page again per your comments, here's a brief summary:

1. added the list of removed metrics.
2. added a task-level INFO metric "dropped-records" that covers all
scenarios and merges in the existing "late-records-drop",
"skipped-records", and "expired-window-records-drop".
3. renamed the util functions of StreamsMetrics as `addLatencyRateTotal`
and `addRateTotal` sensors.


Since I feel it has incorporated all of your comments I'm going to start
the vote thread for this KIP now.


Guozhang


On Tue, Aug 20, 2019 at 9:59 AM Guozhang Wang  wrote:

> Hi Bruno,
>
> No it was not intentional, and we can definitely add the total amount
> sensor as well -- they are just util functions to save users some lines of
> code anyways, and should be straightforward.
>
> Guozhang
>
>
> On Tue, Aug 20, 2019 at 1:05 AM Bruno Cadonna  wrote:
>
>> Hi Guozhang,
>>
>> I totally missed the total invocation count metric in the javadoc.
>> Which brings me to a follow-up question. Should the names of the
>> methods reflect the included total invocation count? We have to rename
>> them anyways. One option would be to simply add `Total` to the method
>> names, i.e., `addLatencyAndRateAndTotalSensor` and
>> `addRateAndTotalSensor` (alternatively without the `And`s). Since
>> those sensors record exclusively invocations, another option would be
>> `addInvocationSensor` and `addInvocationSensorWithoutLatency`.
>>
>> As far as I can see, we have sensors to record invocations but none to
>> record amounts. Is that intentional? No need to add it to this KIP, I
>> am just curious.
>>
>> Best,
>> Bruno
>>
>> On Tue, Aug 20, 2019 at 1:13 AM Guozhang Wang  wrote:
>> >
>> > Hi Bruno,
>> >
>> > Just realized that for `addRateSensor` and `addLatencyAndRateSensor`
>> we've
>> > actually added the total invocation metric already.
>> >
>> >
>> > Guozhang
>> >
>> > On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang 
>> wrote:
>> >
>> > > Hi Bruno,
>> > >
>> > >
>> > > On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna 
>> wrote:
>> > >
>> > >> Hi Guozhang,
>> > >>
>> > >> I left my comments inline.
>> > >>
>> > >> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang 
>> wrote:
>> > >> >
>> > >> > Hello Bruno,
>> > >> >
>> > >> > Thanks for the feedbacks, replied inline.
>> > >> >
>> > >> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
>> > >> wrote:
>> > >> >
>> > >> > > Hi Guozhang,
>> > >> > >
>> > >> > > Thank you for the KIP.
>> > >> > >
>> > >> > > 1) As far as I understand, the StreamsMetrics interface is there
>> for
>> > >> > > user-defined processors. Would it make sense to also add a
>> method to
>> > >> > > the interface to specify a sensor that records skipped records?
>> > >> > >
>> > >> > > Not sure I follow.. if users want to add a specific skipped
>> records
>> > >> > sensor, she can still do that as a "throughput" sensor via "
>> > >> > addThroughputSensor" and then "record" right?
>> > >> >
>> > >> > As an after-thought, maybe it's better to rename `throughput` to
>> `rate`
>> > >> in
>> > >> > the public APIs since it is really meant for the latter semantics.
>> I did
>> > >> > not change it just to make less API changes / deprecate fewer
>> functions.
>> > >> > But if we feel it is important we can change it as well.
>> > >> >
>> > >>
>> > >> I see now that a user can record the rate of skipped records.
>> However,
>> > >> I was referring to the total number of skipped records. Maybe my
>> > >> question should be more general: should we allow the user to also
>> > >> specify sensors for totals or combinations of rate and totals?
>> > >>
>> > >> Sounds good to me, I will add it to the wiki page as well for
>> > > StreamsMetrics.
>> > >
>> > >
>> > >
>> > >> Regarding the naming, I like `rate` more than `throughput`, but I
>> > >> would not fight for it.
>> > >>
>> > >> >
>> > >> > > 2) What are the semantics of active-task-process and
>> > >> standby-task-process
>> > >> > >
>> > >> > > Ah good catch, I think I made it in the wrong column. Just some
>> > >> > explanations here: Within a thread's looped iterations, it will
>> first
>> > >> try
>> > >> > to process some records from the active tasks, and then see if
>> there are
>> > >> > any standby-tasks that can be processed as well (i.e. just reading
>> from
>> > >> the
>> > >> > restore consumer and apply to the local stores). The ratio metrics
>> are
>> > >> for
>> > >> > indicating 1) what tasks (active or standby) does this thread own
>> so
>> > >> far,
>> > >> > and 2) how much time in percentage does it spend on each of them.
>> > >> >
>> > >> > But this metric should really be a task-level one that includes
>> both the
>> > >> > thread-id and task-id, and upon task migrations they will be
>> dynamically
>> > >> > deleted / (re)-created. For each task-id it may be owned by
>> multiple
>> > >> > threads as one active and others standby, and hence the separation
>> of
>> > >> > active / standby seems still necessary.
>> > >> >
>> > >>
>> > >> Makes sense.
>> 

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-04 Thread Bruno Cadonna
Hi,

I am sorry to restart the discussion here, but I came across a small
issue in the KIP.

I started to implement KIP-444 and I am bit concerned about the values
for the the config `built.in.metrics.version`. In the KIP the possible
values are specified as all Kafka Streams versions. I think that this
set of values is really hard to maintain in the code and it also blows
up the testing burden unnecessarily because all versions need to be
tested. My proposal (backed by John) is to use the following values:
- `latest` for the latest version of the metrics
- `0.10.0-2.3` for the version before `latest`
If in future, let's say in version 4.1, we need again to change the
metrics, we would add `2.4-4.0` to the values of the config. With
major versions, we could also get rid of some values.

WDYT?

You can also have a look at the PR
https://github.com/apache/kafka/pull/7279 to see this in code.

Best,
Bruno

On Tue, Aug 20, 2019 at 8:29 PM Guozhang Wang  wrote:
>
> Hello Bruno,
>
> I've updated the wiki page again per your comments, here's a brief summary:
>
> 1. added the list of removed metrics.
> 2. added a task-level INFO metric "dropped-records" that covers all
> scenarios and merges in the existing "late-records-drop",
> "skipped-records", and "expired-window-records-drop".
> 3. renamed the util functions of StreamsMetrics as `addLatencyRateTotal`
> and `addRateTotal` sensors.
>
>
> Since I feel it has incorporated all of your comments I'm going to start
> the vote thread for this KIP now.
>
>
> Guozhang
>
>
> On Tue, Aug 20, 2019 at 9:59 AM Guozhang Wang  wrote:
>
> > Hi Bruno,
> >
> > No it was not intentional, and we can definitely add the total amount
> > sensor as well -- they are just util functions to save users some lines of
> > code anyways, and should be straightforward.
> >
> > Guozhang
> >
> >
> > On Tue, Aug 20, 2019 at 1:05 AM Bruno Cadonna  wrote:
> >
> >> Hi Guozhang,
> >>
> >> I totally missed the total invocation count metric in the javadoc.
> >> Which brings me to a follow-up question. Should the names of the
> >> methods reflect the included total invocation count? We have to rename
> >> them anyways. One option would be to simply add `Total` to the method
> >> names, i.e., `addLatencyAndRateAndTotalSensor` and
> >> `addRateAndTotalSensor` (alternatively without the `And`s). Since
> >> those sensors record exclusively invocations, another option would be
> >> `addInvocationSensor` and `addInvocationSensorWithoutLatency`.
> >>
> >> As far as I can see, we have sensors to record invocations but none to
> >> record amounts. Is that intentional? No need to add it to this KIP, I
> >> am just curious.
> >>
> >> Best,
> >> Bruno
> >>
> >> On Tue, Aug 20, 2019 at 1:13 AM Guozhang Wang  wrote:
> >> >
> >> > Hi Bruno,
> >> >
> >> > Just realized that for `addRateSensor` and `addLatencyAndRateSensor`
> >> we've
> >> > actually added the total invocation metric already.
> >> >
> >> >
> >> > Guozhang
> >> >
> >> > On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang 
> >> wrote:
> >> >
> >> > > Hi Bruno,
> >> > >
> >> > >
> >> > > On Tue, Aug 6, 2019 at 1:51 AM Bruno Cadonna 
> >> wrote:
> >> > >
> >> > >> Hi Guozhang,
> >> > >>
> >> > >> I left my comments inline.
> >> > >>
> >> > >> On Thu, Jul 18, 2019 at 8:28 PM Guozhang Wang 
> >> wrote:
> >> > >> >
> >> > >> > Hello Bruno,
> >> > >> >
> >> > >> > Thanks for the feedbacks, replied inline.
> >> > >> >
> >> > >> > On Mon, Jul 1, 2019 at 7:08 AM Bruno Cadonna 
> >> > >> wrote:
> >> > >> >
> >> > >> > > Hi Guozhang,
> >> > >> > >
> >> > >> > > Thank you for the KIP.
> >> > >> > >
> >> > >> > > 1) As far as I understand, the StreamsMetrics interface is there
> >> for
> >> > >> > > user-defined processors. Would it make sense to also add a
> >> method to
> >> > >> > > the interface to specify a sensor that records skipped records?
> >> > >> > >
> >> > >> > > Not sure I follow.. if users want to add a specific skipped
> >> records
> >> > >> > sensor, she can still do that as a "throughput" sensor via "
> >> > >> > addThroughputSensor" and then "record" right?
> >> > >> >
> >> > >> > As an after-thought, maybe it's better to rename `throughput` to
> >> `rate`
> >> > >> in
> >> > >> > the public APIs since it is really meant for the latter semantics.
> >> I did
> >> > >> > not change it just to make less API changes / deprecate fewer
> >> functions.
> >> > >> > But if we feel it is important we can change it as well.
> >> > >> >
> >> > >>
> >> > >> I see now that a user can record the rate of skipped records.
> >> However,
> >> > >> I was referring to the total number of skipped records. Maybe my
> >> > >> question should be more general: should we allow the user to also
> >> > >> specify sensors for totals or combinations of rate and totals?
> >> > >>
> >> > >> Sounds good to me, I will add it to the wiki page as well for
> >> > > StreamsMetrics.
> >> > >
> >> > >
> >> > >
> >> > >> Regarding the naming, I like `rate` more than `throughput`, but I
> >> > >> 

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-05 Thread Matthias J. Sax
I share Bruno's concern about future releases, however, I would make
slightly different proposal.

Instead of using "latest" we can just make the config optional and if
not set, we use the new metrics code? This way we don't need to add a
new version number each time we do a new release (note, that it would be
weird to keep default value "2.4" in future releases).

For enabling backward compatibility: I don't have a strong opinion if we
should have a single value "0.10.0-2.3" or list each version individually.

In KIP-268 (fixing metadata upgrade) we decided to list each version
individually as it seems simpler for users. Also, we wanted to hide
which release uses which metadata version (v0 in 0.10.0, and v1 in
0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a single
value though but it seemed not to give best user experience.

I think this KIP is a little different though and both options seems to
be valid. However, I would like to emphasize that we should optimize for
user experience (and not if it's harder/easier to test etc---in doubt,
we should always take on the burden if is helps to lift the burden from
users).

Overall, I am +1

Some nits:

(1) I think the motivation section for updating `StreamsMetrics`
interface does not make it clear why we need the change. What is the
issue with the current interface and how do the new method address the issue

(2) The metric name `put | put-if-absent .. | get-latency (avg | max)`
is hard to read because is indicate that there is a `get-latency` method
call on stores -- can we update it to

`(put | put-if-absent .. | get)-latency (avg | max)`

(3) typo: `When users override it to "2.2" or below,` this should be
"2.3" -- or maybe even different if Bruno's concern gets addressed.




-Matthias





On 9/4/19 12:26 PM, Bruno Cadonna wrote:
> Hi,
> 
> I am sorry to restart the discussion here, but I came across a small
> issue in the KIP.
> 
> I started to implement KIP-444 and I am bit concerned about the values
> for the the config `built.in.metrics.version`. In the KIP the possible
> values are specified as all Kafka Streams versions. I think that this
> set of values is really hard to maintain in the code and it also blows
> up the testing burden unnecessarily because all versions need to be
> tested. My proposal (backed by John) is to use the following values:
> - `latest` for the latest version of the metrics
> - `0.10.0-2.3` for the version before `latest`
> If in future, let's say in version 4.1, we need again to change the
> metrics, we would add `2.4-4.0` to the values of the config. With
> major versions, we could also get rid of some values.
> 
> WDYT?
> 
> You can also have a look at the PR
> https://github.com/apache/kafka/pull/7279 to see this in code.
> 
> Best,
> Bruno
> 
> On Tue, Aug 20, 2019 at 8:29 PM Guozhang Wang  wrote:
>>
>> Hello Bruno,
>>
>> I've updated the wiki page again per your comments, here's a brief summary:
>>
>> 1. added the list of removed metrics.
>> 2. added a task-level INFO metric "dropped-records" that covers all
>> scenarios and merges in the existing "late-records-drop",
>> "skipped-records", and "expired-window-records-drop".
>> 3. renamed the util functions of StreamsMetrics as `addLatencyRateTotal`
>> and `addRateTotal` sensors.
>>
>>
>> Since I feel it has incorporated all of your comments I'm going to start
>> the vote thread for this KIP now.
>>
>>
>> Guozhang
>>
>>
>> On Tue, Aug 20, 2019 at 9:59 AM Guozhang Wang  wrote:
>>
>>> Hi Bruno,
>>>
>>> No it was not intentional, and we can definitely add the total amount
>>> sensor as well -- they are just util functions to save users some lines of
>>> code anyways, and should be straightforward.
>>>
>>> Guozhang
>>>
>>>
>>> On Tue, Aug 20, 2019 at 1:05 AM Bruno Cadonna  wrote:
>>>
 Hi Guozhang,

 I totally missed the total invocation count metric in the javadoc.
 Which brings me to a follow-up question. Should the names of the
 methods reflect the included total invocation count? We have to rename
 them anyways. One option would be to simply add `Total` to the method
 names, i.e., `addLatencyAndRateAndTotalSensor` and
 `addRateAndTotalSensor` (alternatively without the `And`s). Since
 those sensors record exclusively invocations, another option would be
 `addInvocationSensor` and `addInvocationSensorWithoutLatency`.

 As far as I can see, we have sensors to record invocations but none to
 record amounts. Is that intentional? No need to add it to this KIP, I
 am just curious.

 Best,
 Bruno

 On Tue, Aug 20, 2019 at 1:13 AM Guozhang Wang  wrote:
>
> Hi Bruno,
>
> Just realized that for `addRateSensor` and `addLatencyAndRateSensor`
 we've
> actually added the total invocation metric already.
>
>
> Guozhang
>
> On Mon, Aug 19, 2019 at 4:11 PM Guozhang Wang 
 wrote:
>
>> Hi Bruno,
>>
>>
>> On Tue, Aug 6, 2019 at 1:51 

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-05 Thread Guozhang Wang
Hi Bruno,

Thanks for raising this point. I think the main motivation behind this
proposal is, like Matthias said, to ease the understanding burden from
users to our own shoulders. Testing wise, I think we do not necessarily
need to explode the testing matrix but just test the last version before
each metrics refactoring (again, hopefully it is the only time) and hence I
think it worth benefiting user's experience. WDYT?

Hi Matthias,

Thanks for your feedback, I will update the wiki page accordingly.

Will also close the other voting thread with your vote.

Guozhang

On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax 
wrote:

> I share Bruno's concern about future releases, however, I would make
> slightly different proposal.
>
> Instead of using "latest" we can just make the config optional and if
> not set, we use the new metrics code? This way we don't need to add a
> new version number each time we do a new release (note, that it would be
> weird to keep default value "2.4" in future releases).
>
> For enabling backward compatibility: I don't have a strong opinion if we
> should have a single value "0.10.0-2.3" or list each version individually.
>
> In KIP-268 (fixing metadata upgrade) we decided to list each version
> individually as it seems simpler for users. Also, we wanted to hide
> which release uses which metadata version (v0 in 0.10.0, and v1 in
> 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a single
> value though but it seemed not to give best user experience.
>
> I think this KIP is a little different though and both options seems to
> be valid. However, I would like to emphasize that we should optimize for
> user experience (and not if it's harder/easier to test etc---in doubt,
> we should always take on the burden if is helps to lift the burden from
> users).
>
> Overall, I am +1
>
> Some nits:
>
> (1) I think the motivation section for updating `StreamsMetrics`
> interface does not make it clear why we need the change. What is the
> issue with the current interface and how do the new method address the
> issue
>
> (2) The metric name `put | put-if-absent .. | get-latency (avg | max)`
> is hard to read because is indicate that there is a `get-latency` method
> call on stores -- can we update it to
>
> `(put | put-if-absent .. | get)-latency (avg | max)`
>
> (3) typo: `When users override it to "2.2" or below,` this should be
> "2.3" -- or maybe even different if Bruno's concern gets addressed.
>
>
>
>
> -Matthias
>
>
>
>
>
> On 9/4/19 12:26 PM, Bruno Cadonna wrote:
> > Hi,
> >
> > I am sorry to restart the discussion here, but I came across a small
> > issue in the KIP.
> >
> > I started to implement KIP-444 and I am bit concerned about the values
> > for the the config `built.in.metrics.version`. In the KIP the possible
> > values are specified as all Kafka Streams versions. I think that this
> > set of values is really hard to maintain in the code and it also blows
> > up the testing burden unnecessarily because all versions need to be
> > tested. My proposal (backed by John) is to use the following values:
> > - `latest` for the latest version of the metrics
> > - `0.10.0-2.3` for the version before `latest`
> > If in future, let's say in version 4.1, we need again to change the
> > metrics, we would add `2.4-4.0` to the values of the config. With
> > major versions, we could also get rid of some values.
> >
> > WDYT?
> >
> > You can also have a look at the PR
> > https://github.com/apache/kafka/pull/7279 to see this in code.
> >
> > Best,
> > Bruno
> >
> > On Tue, Aug 20, 2019 at 8:29 PM Guozhang Wang 
> wrote:
> >>
> >> Hello Bruno,
> >>
> >> I've updated the wiki page again per your comments, here's a brief
> summary:
> >>
> >> 1. added the list of removed metrics.
> >> 2. added a task-level INFO metric "dropped-records" that covers all
> >> scenarios and merges in the existing "late-records-drop",
> >> "skipped-records", and "expired-window-records-drop".
> >> 3. renamed the util functions of StreamsMetrics as `addLatencyRateTotal`
> >> and `addRateTotal` sensors.
> >>
> >>
> >> Since I feel it has incorporated all of your comments I'm going to start
> >> the vote thread for this KIP now.
> >>
> >>
> >> Guozhang
> >>
> >>
> >> On Tue, Aug 20, 2019 at 9:59 AM Guozhang Wang 
> wrote:
> >>
> >>> Hi Bruno,
> >>>
> >>> No it was not intentional, and we can definitely add the total amount
> >>> sensor as well -- they are just util functions to save users some
> lines of
> >>> code anyways, and should be straightforward.
> >>>
> >>> Guozhang
> >>>
> >>>
> >>> On Tue, Aug 20, 2019 at 1:05 AM Bruno Cadonna 
> wrote:
> >>>
>  Hi Guozhang,
> 
>  I totally missed the total invocation count metric in the javadoc.
>  Which brings me to a follow-up question. Should the names of the
>  methods reflect the included total invocation count? We have to rename
>  them anyways. One option would be to simply add `Total` to the method
>  names, i.e., `addLatencyA

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-05 Thread Bruno Cadonna
Hi Guozhang,

I think user experience and code maintenance are tightly related. The
harder to maintain the code the worse the user experience will get.

Making the config optional does not solve the issue. Wouldn't users be
puzzled when we release 2.5 and they cannot set
built.in.metrics.version to 2.4 to be sure to get the same metrics for
that version? It seems with that solution we would just move
maintenance to the next release.

I cannot see how the values `latest` and `0.10.0-2.3` would lead to a
bad user experience.

Regarding testing, at least on integration test level, we absolutely
need to test all versions. It is too easy to make a mistake with so
many versions. Remember that on integration test level we need to
start an embedded Kafka for each single test.

Best,
Bruno

On Thu, Sep 5, 2019 at 8:46 PM Guozhang Wang  wrote:
>
> Hi Bruno,
>
> Thanks for raising this point. I think the main motivation behind this
> proposal is, like Matthias said, to ease the understanding burden from
> users to our own shoulders. Testing wise, I think we do not necessarily
> need to explode the testing matrix but just test the last version before
> each metrics refactoring (again, hopefully it is the only time) and hence I
> think it worth benefiting user's experience. WDYT?
>
> Hi Matthias,
>
> Thanks for your feedback, I will update the wiki page accordingly.
>
> Will also close the other voting thread with your vote.
>
> Guozhang
>
> On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax 
> wrote:
>
> > I share Bruno's concern about future releases, however, I would make
> > slightly different proposal.
> >
> > Instead of using "latest" we can just make the config optional and if
> > not set, we use the new metrics code? This way we don't need to add a
> > new version number each time we do a new release (note, that it would be
> > weird to keep default value "2.4" in future releases).
> >
> > For enabling backward compatibility: I don't have a strong opinion if we
> > should have a single value "0.10.0-2.3" or list each version individually.
> >
> > In KIP-268 (fixing metadata upgrade) we decided to list each version
> > individually as it seems simpler for users. Also, we wanted to hide
> > which release uses which metadata version (v0 in 0.10.0, and v1 in
> > 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a single
> > value though but it seemed not to give best user experience.
> >
> > I think this KIP is a little different though and both options seems to
> > be valid. However, I would like to emphasize that we should optimize for
> > user experience (and not if it's harder/easier to test etc---in doubt,
> > we should always take on the burden if is helps to lift the burden from
> > users).
> >
> > Overall, I am +1
> >
> > Some nits:
> >
> > (1) I think the motivation section for updating `StreamsMetrics`
> > interface does not make it clear why we need the change. What is the
> > issue with the current interface and how do the new method address the
> > issue
> >
> > (2) The metric name `put | put-if-absent .. | get-latency (avg | max)`
> > is hard to read because is indicate that there is a `get-latency` method
> > call on stores -- can we update it to
> >
> > `(put | put-if-absent .. | get)-latency (avg | max)`
> >
> > (3) typo: `When users override it to "2.2" or below,` this should be
> > "2.3" -- or maybe even different if Bruno's concern gets addressed.
> >
> >
> >
> >
> > -Matthias
> >
> >
> >
> >
> >
> > On 9/4/19 12:26 PM, Bruno Cadonna wrote:
> > > Hi,
> > >
> > > I am sorry to restart the discussion here, but I came across a small
> > > issue in the KIP.
> > >
> > > I started to implement KIP-444 and I am bit concerned about the values
> > > for the the config `built.in.metrics.version`. In the KIP the possible
> > > values are specified as all Kafka Streams versions. I think that this
> > > set of values is really hard to maintain in the code and it also blows
> > > up the testing burden unnecessarily because all versions need to be
> > > tested. My proposal (backed by John) is to use the following values:
> > > - `latest` for the latest version of the metrics
> > > - `0.10.0-2.3` for the version before `latest`
> > > If in future, let's say in version 4.1, we need again to change the
> > > metrics, we would add `2.4-4.0` to the values of the config. With
> > > major versions, we could also get rid of some values.
> > >
> > > WDYT?
> > >
> > > You can also have a look at the PR
> > > https://github.com/apache/kafka/pull/7279 to see this in code.
> > >
> > > Best,
> > > Bruno
> > >
> > > On Tue, Aug 20, 2019 at 8:29 PM Guozhang Wang 
> > wrote:
> > >>
> > >> Hello Bruno,
> > >>
> > >> I've updated the wiki page again per your comments, here's a brief
> > summary:
> > >>
> > >> 1. added the list of removed metrics.
> > >> 2. added a task-level INFO metric "dropped-records" that covers all
> > >> scenarios and merges in the existing "late-records-drop",
> > >> "skipped-records", and "expi

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-05 Thread Guozhang Wang
Hello Bruno,

I think your concern makes sense, let's adopt this suggestion in KIP-444
instead. Just to clarify:

1. The default value would be "latest".
2. The only other valid value is "0.10.0-2.3".

And moving forward this config may stay without any new values.


Guozhang


On Thu, Sep 5, 2019 at 12:16 PM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> I think user experience and code maintenance are tightly related. The
> harder to maintain the code the worse the user experience will get.
>
> Making the config optional does not solve the issue. Wouldn't users be
> puzzled when we release 2.5 and they cannot set
> built.in.metrics.version to 2.4 to be sure to get the same metrics for
> that version? It seems with that solution we would just move
> maintenance to the next release.
>
> I cannot see how the values `latest` and `0.10.0-2.3` would lead to a
> bad user experience.
>
> Regarding testing, at least on integration test level, we absolutely
> need to test all versions. It is too easy to make a mistake with so
> many versions. Remember that on integration test level we need to
> start an embedded Kafka for each single test.
>
> Best,
> Bruno
>
> On Thu, Sep 5, 2019 at 8:46 PM Guozhang Wang  wrote:
> >
> > Hi Bruno,
> >
> > Thanks for raising this point. I think the main motivation behind this
> > proposal is, like Matthias said, to ease the understanding burden from
> > users to our own shoulders. Testing wise, I think we do not necessarily
> > need to explode the testing matrix but just test the last version before
> > each metrics refactoring (again, hopefully it is the only time) and
> hence I
> > think it worth benefiting user's experience. WDYT?
> >
> > Hi Matthias,
> >
> > Thanks for your feedback, I will update the wiki page accordingly.
> >
> > Will also close the other voting thread with your vote.
> >
> > Guozhang
> >
> > On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax 
> > wrote:
> >
> > > I share Bruno's concern about future releases, however, I would make
> > > slightly different proposal.
> > >
> > > Instead of using "latest" we can just make the config optional and if
> > > not set, we use the new metrics code? This way we don't need to add a
> > > new version number each time we do a new release (note, that it would
> be
> > > weird to keep default value "2.4" in future releases).
> > >
> > > For enabling backward compatibility: I don't have a strong opinion if
> we
> > > should have a single value "0.10.0-2.3" or list each version
> individually.
> > >
> > > In KIP-268 (fixing metadata upgrade) we decided to list each version
> > > individually as it seems simpler for users. Also, we wanted to hide
> > > which release uses which metadata version (v0 in 0.10.0, and v1 in
> > > 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a single
> > > value though but it seemed not to give best user experience.
> > >
> > > I think this KIP is a little different though and both options seems to
> > > be valid. However, I would like to emphasize that we should optimize
> for
> > > user experience (and not if it's harder/easier to test etc---in doubt,
> > > we should always take on the burden if is helps to lift the burden from
> > > users).
> > >
> > > Overall, I am +1
> > >
> > > Some nits:
> > >
> > > (1) I think the motivation section for updating `StreamsMetrics`
> > > interface does not make it clear why we need the change. What is the
> > > issue with the current interface and how do the new method address the
> > > issue
> > >
> > > (2) The metric name `put | put-if-absent .. | get-latency (avg | max)`
> > > is hard to read because is indicate that there is a `get-latency`
> method
> > > call on stores -- can we update it to
> > >
> > > `(put | put-if-absent .. | get)-latency (avg | max)`
> > >
> > > (3) typo: `When users override it to "2.2" or below,` this should be
> > > "2.3" -- or maybe even different if Bruno's concern gets addressed.
> > >
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > >
> > >
> > >
> > > On 9/4/19 12:26 PM, Bruno Cadonna wrote:
> > > > Hi,
> > > >
> > > > I am sorry to restart the discussion here, but I came across a small
> > > > issue in the KIP.
> > > >
> > > > I started to implement KIP-444 and I am bit concerned about the
> values
> > > > for the the config `built.in.metrics.version`. In the KIP the
> possible
> > > > values are specified as all Kafka Streams versions. I think that this
> > > > set of values is really hard to maintain in the code and it also
> blows
> > > > up the testing burden unnecessarily because all versions need to be
> > > > tested. My proposal (backed by John) is to use the following values:
> > > > - `latest` for the latest version of the metrics
> > > > - `0.10.0-2.3` for the version before `latest`
> > > > If in future, let's say in version 4.1, we need again to change the
> > > > metrics, we would add `2.4-4.0` to the values of the config. With
> > > > major versions, we could also get rid of some values.
> > > >
> > > 

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-05 Thread Bruno Cadonna
Hi Guozhang,

Your summary corresponds to my proposal.

A new value would only be added if in future we change the metrics in
a backward-incompatible way, i.e., 2.4-. "latest" will always stay the default.

Best,
Bruno

On Thu, Sep 5, 2019 at 10:57 PM Guozhang Wang  wrote:
>
> Hello Bruno,
>
> I think your concern makes sense, let's adopt this suggestion in KIP-444
> instead. Just to clarify:
>
> 1. The default value would be "latest".
> 2. The only other valid value is "0.10.0-2.3".
>
> And moving forward this config may stay without any new values.
>
>
> Guozhang
>
>
> On Thu, Sep 5, 2019 at 12:16 PM Bruno Cadonna  wrote:
>
> > Hi Guozhang,
> >
> > I think user experience and code maintenance are tightly related. The
> > harder to maintain the code the worse the user experience will get.
> >
> > Making the config optional does not solve the issue. Wouldn't users be
> > puzzled when we release 2.5 and they cannot set
> > built.in.metrics.version to 2.4 to be sure to get the same metrics for
> > that version? It seems with that solution we would just move
> > maintenance to the next release.
> >
> > I cannot see how the values `latest` and `0.10.0-2.3` would lead to a
> > bad user experience.
> >
> > Regarding testing, at least on integration test level, we absolutely
> > need to test all versions. It is too easy to make a mistake with so
> > many versions. Remember that on integration test level we need to
> > start an embedded Kafka for each single test.
> >
> > Best,
> > Bruno
> >
> > On Thu, Sep 5, 2019 at 8:46 PM Guozhang Wang  wrote:
> > >
> > > Hi Bruno,
> > >
> > > Thanks for raising this point. I think the main motivation behind this
> > > proposal is, like Matthias said, to ease the understanding burden from
> > > users to our own shoulders. Testing wise, I think we do not necessarily
> > > need to explode the testing matrix but just test the last version before
> > > each metrics refactoring (again, hopefully it is the only time) and
> > hence I
> > > think it worth benefiting user's experience. WDYT?
> > >
> > > Hi Matthias,
> > >
> > > Thanks for your feedback, I will update the wiki page accordingly.
> > >
> > > Will also close the other voting thread with your vote.
> > >
> > > Guozhang
> > >
> > > On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax 
> > > wrote:
> > >
> > > > I share Bruno's concern about future releases, however, I would make
> > > > slightly different proposal.
> > > >
> > > > Instead of using "latest" we can just make the config optional and if
> > > > not set, we use the new metrics code? This way we don't need to add a
> > > > new version number each time we do a new release (note, that it would
> > be
> > > > weird to keep default value "2.4" in future releases).
> > > >
> > > > For enabling backward compatibility: I don't have a strong opinion if
> > we
> > > > should have a single value "0.10.0-2.3" or list each version
> > individually.
> > > >
> > > > In KIP-268 (fixing metadata upgrade) we decided to list each version
> > > > individually as it seems simpler for users. Also, we wanted to hide
> > > > which release uses which metadata version (v0 in 0.10.0, and v1 in
> > > > 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a single
> > > > value though but it seemed not to give best user experience.
> > > >
> > > > I think this KIP is a little different though and both options seems to
> > > > be valid. However, I would like to emphasize that we should optimize
> > for
> > > > user experience (and not if it's harder/easier to test etc---in doubt,
> > > > we should always take on the burden if is helps to lift the burden from
> > > > users).
> > > >
> > > > Overall, I am +1
> > > >
> > > > Some nits:
> > > >
> > > > (1) I think the motivation section for updating `StreamsMetrics`
> > > > interface does not make it clear why we need the change. What is the
> > > > issue with the current interface and how do the new method address the
> > > > issue
> > > >
> > > > (2) The metric name `put | put-if-absent .. | get-latency (avg | max)`
> > > > is hard to read because is indicate that there is a `get-latency`
> > method
> > > > call on stores -- can we update it to
> > > >
> > > > `(put | put-if-absent .. | get)-latency (avg | max)`
> > > >
> > > > (3) typo: `When users override it to "2.2" or below,` this should be
> > > > "2.3" -- or maybe even different if Bruno's concern gets addressed.
> > > >
> > > >
> > > >
> > > >
> > > > -Matthias
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 9/4/19 12:26 PM, Bruno Cadonna wrote:
> > > > > Hi,
> > > > >
> > > > > I am sorry to restart the discussion here, but I came across a small
> > > > > issue in the KIP.
> > > > >
> > > > > I started to implement KIP-444 and I am bit concerned about the
> > values
> > > > > for the the config `built.in.metrics.version`. In the KIP the
> > possible
> > > > > values are specified as all Kafka Streams versions. I think that this
> > > > > set of values is really hard to maintain

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-05 Thread John Roesler
Thanks, all.

FWIW, the most recent formulation from Guozhang + Bruno's addendum would
have my support.

Thanks,
-John

On Thu, Sep 5, 2019 at 4:05 PM Bruno Cadonna  wrote:

> Hi Guozhang,
>
> Your summary corresponds to my proposal.
>
> A new value would only be added if in future we change the metrics in
> a backward-incompatible way, i.e., 2.4- breaking change>. "latest" will always stay the default.
>
> Best,
> Bruno
>
> On Thu, Sep 5, 2019 at 10:57 PM Guozhang Wang  wrote:
> >
> > Hello Bruno,
> >
> > I think your concern makes sense, let's adopt this suggestion in KIP-444
> > instead. Just to clarify:
> >
> > 1. The default value would be "latest".
> > 2. The only other valid value is "0.10.0-2.3".
> >
> > And moving forward this config may stay without any new values.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Sep 5, 2019 at 12:16 PM Bruno Cadonna 
> wrote:
> >
> > > Hi Guozhang,
> > >
> > > I think user experience and code maintenance are tightly related. The
> > > harder to maintain the code the worse the user experience will get.
> > >
> > > Making the config optional does not solve the issue. Wouldn't users be
> > > puzzled when we release 2.5 and they cannot set
> > > built.in.metrics.version to 2.4 to be sure to get the same metrics for
> > > that version? It seems with that solution we would just move
> > > maintenance to the next release.
> > >
> > > I cannot see how the values `latest` and `0.10.0-2.3` would lead to a
> > > bad user experience.
> > >
> > > Regarding testing, at least on integration test level, we absolutely
> > > need to test all versions. It is too easy to make a mistake with so
> > > many versions. Remember that on integration test level we need to
> > > start an embedded Kafka for each single test.
> > >
> > > Best,
> > > Bruno
> > >
> > > On Thu, Sep 5, 2019 at 8:46 PM Guozhang Wang 
> wrote:
> > > >
> > > > Hi Bruno,
> > > >
> > > > Thanks for raising this point. I think the main motivation behind
> this
> > > > proposal is, like Matthias said, to ease the understanding burden
> from
> > > > users to our own shoulders. Testing wise, I think we do not
> necessarily
> > > > need to explode the testing matrix but just test the last version
> before
> > > > each metrics refactoring (again, hopefully it is the only time) and
> > > hence I
> > > > think it worth benefiting user's experience. WDYT?
> > > >
> > > > Hi Matthias,
> > > >
> > > > Thanks for your feedback, I will update the wiki page accordingly.
> > > >
> > > > Will also close the other voting thread with your vote.
> > > >
> > > > Guozhang
> > > >
> > > > On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax <
> matth...@confluent.io>
> > > > wrote:
> > > >
> > > > > I share Bruno's concern about future releases, however, I would
> make
> > > > > slightly different proposal.
> > > > >
> > > > > Instead of using "latest" we can just make the config optional and
> if
> > > > > not set, we use the new metrics code? This way we don't need to
> add a
> > > > > new version number each time we do a new release (note, that it
> would
> > > be
> > > > > weird to keep default value "2.4" in future releases).
> > > > >
> > > > > For enabling backward compatibility: I don't have a strong opinion
> if
> > > we
> > > > > should have a single value "0.10.0-2.3" or list each version
> > > individually.
> > > > >
> > > > > In KIP-268 (fixing metadata upgrade) we decided to list each
> version
> > > > > individually as it seems simpler for users. Also, we wanted to hide
> > > > > which release uses which metadata version (v0 in 0.10.0, and v1 in
> > > > > 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a
> single
> > > > > value though but it seemed not to give best user experience.
> > > > >
> > > > > I think this KIP is a little different though and both options
> seems to
> > > > > be valid. However, I would like to emphasize that we should
> optimize
> > > for
> > > > > user experience (and not if it's harder/easier to test etc---in
> doubt,
> > > > > we should always take on the burden if is helps to lift the burden
> from
> > > > > users).
> > > > >
> > > > > Overall, I am +1
> > > > >
> > > > > Some nits:
> > > > >
> > > > > (1) I think the motivation section for updating `StreamsMetrics`
> > > > > interface does not make it clear why we need the change. What is
> the
> > > > > issue with the current interface and how do the new method address
> the
> > > > > issue
> > > > >
> > > > > (2) The metric name `put | put-if-absent .. | get-latency (avg |
> max)`
> > > > > is hard to read because is indicate that there is a `get-latency`
> > > method
> > > > > call on stores -- can we update it to
> > > > >
> > > > > `(put | put-if-absent .. | get)-latency (avg | max)`
> > > > >
> > > > > (3) typo: `When users override it to "2.2" or below,` this should
> be
> > > > > "2.3" -- or maybe even different if Bruno's concern gets addressed.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > >
> > > > >

Re: [DISCUSS] KIP-444: Augment Metrics for Kafka Streams

2019-09-09 Thread Matthias J. Sax
Works for me.

(Btw: I did not really vote, just stated my overall support :P )

On 9/5/19 2:50 PM, John Roesler wrote:
> Thanks, all.
> 
> FWIW, the most recent formulation from Guozhang + Bruno's addendum would
> have my support.
> 
> Thanks,
> -John
> 
> On Thu, Sep 5, 2019 at 4:05 PM Bruno Cadonna  wrote:
> 
>> Hi Guozhang,
>>
>> Your summary corresponds to my proposal.
>>
>> A new value would only be added if in future we change the metrics in
>> a backward-incompatible way, i.e., 2.4-> breaking change>. "latest" will always stay the default.
>>
>> Best,
>> Bruno
>>
>> On Thu, Sep 5, 2019 at 10:57 PM Guozhang Wang  wrote:
>>>
>>> Hello Bruno,
>>>
>>> I think your concern makes sense, let's adopt this suggestion in KIP-444
>>> instead. Just to clarify:
>>>
>>> 1. The default value would be "latest".
>>> 2. The only other valid value is "0.10.0-2.3".
>>>
>>> And moving forward this config may stay without any new values.
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Thu, Sep 5, 2019 at 12:16 PM Bruno Cadonna 
>> wrote:
>>>
 Hi Guozhang,

 I think user experience and code maintenance are tightly related. The
 harder to maintain the code the worse the user experience will get.

 Making the config optional does not solve the issue. Wouldn't users be
 puzzled when we release 2.5 and they cannot set
 built.in.metrics.version to 2.4 to be sure to get the same metrics for
 that version? It seems with that solution we would just move
 maintenance to the next release.

 I cannot see how the values `latest` and `0.10.0-2.3` would lead to a
 bad user experience.

 Regarding testing, at least on integration test level, we absolutely
 need to test all versions. It is too easy to make a mistake with so
 many versions. Remember that on integration test level we need to
 start an embedded Kafka for each single test.

 Best,
 Bruno

 On Thu, Sep 5, 2019 at 8:46 PM Guozhang Wang 
>> wrote:
>
> Hi Bruno,
>
> Thanks for raising this point. I think the main motivation behind
>> this
> proposal is, like Matthias said, to ease the understanding burden
>> from
> users to our own shoulders. Testing wise, I think we do not
>> necessarily
> need to explode the testing matrix but just test the last version
>> before
> each metrics refactoring (again, hopefully it is the only time) and
 hence I
> think it worth benefiting user's experience. WDYT?
>
> Hi Matthias,
>
> Thanks for your feedback, I will update the wiki page accordingly.
>
> Will also close the other voting thread with your vote.
>
> Guozhang
>
> On Thu, Sep 5, 2019 at 11:27 AM Matthias J. Sax <
>> matth...@confluent.io>
> wrote:
>
>> I share Bruno's concern about future releases, however, I would
>> make
>> slightly different proposal.
>>
>> Instead of using "latest" we can just make the config optional and
>> if
>> not set, we use the new metrics code? This way we don't need to
>> add a
>> new version number each time we do a new release (note, that it
>> would
 be
>> weird to keep default value "2.4" in future releases).
>>
>> For enabling backward compatibility: I don't have a strong opinion
>> if
 we
>> should have a single value "0.10.0-2.3" or list each version
 individually.
>>
>> In KIP-268 (fixing metadata upgrade) we decided to list each
>> version
>> individually as it seems simpler for users. Also, we wanted to hide
>> which release uses which metadata version (v0 in 0.10.0, and v1 in
>> 0.10.1 to 1.1). We could have collapsed `0.10.1` to `1.1` into a
>> single
>> value though but it seemed not to give best user experience.
>>
>> I think this KIP is a little different though and both options
>> seems to
>> be valid. However, I would like to emphasize that we should
>> optimize
 for
>> user experience (and not if it's harder/easier to test etc---in
>> doubt,
>> we should always take on the burden if is helps to lift the burden
>> from
>> users).
>>
>> Overall, I am +1
>>
>> Some nits:
>>
>> (1) I think the motivation section for updating `StreamsMetrics`
>> interface does not make it clear why we need the change. What is
>> the
>> issue with the current interface and how do the new method address
>> the
>> issue
>>
>> (2) The metric name `put | put-if-absent .. | get-latency (avg |
>> max)`
>> is hard to read because is indicate that there is a `get-latency`
 method
>> call on stores -- can we update it to
>>
>> `(put | put-if-absent .. | get)-latency (avg | max)`
>>
>> (3) typo: `When users override it to "2.2" or below,` this should
>> be
>> "2.3" -- or maybe even different if Bruno's concern gets addressed.
>>
>>
>>
>>
>> -Matthias
>>
>>
>>
>>
>>
>> On 9/4/19 12:26 PM,