Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Sophie Blee-Goldman Wed, 13 May 2020 19:15:31 -0700

Yeah, the specific reason was just to align with the current metrics.

Is it better to conform than to be right? History has a lot to say on that
matter
but I'm not sure how much of it applies to the fine details of metrics
naming :P


More seriously, I figured if people are looking at this metric they're
likely to
be looking at all the others. Then naming this one "-mean" would probably
lead some to conclude that the "-avg" suffix in the other metrics has a
different meaning.

As for the percentiles, I actually like p99 (and p75) better. I'll swap
that out

On Wed, May 13, 2020 at 7:07 PM John Roesler <[email protected]> wrote:

> Thanks Sophie,
>
> I hope this isn't too nit-picky, but is there a reason to choose "avg"
> instead
> of "mean"? Maybe this is too paranoid, and I might be oversensitive because
> of the mistake I just made earlier, but it strikes me that "avg" is
> actually
> ambiguous, as it refers to a family of statistics, whereas "mean" is
> specific.
> I see other Kafka metrics with "avg", but none with "mean"; was that the
> reason? If so, I'm +1.
>
> Regarding the names of the percentile, I actually couldn't find _any_ other
> metrics that use percentile. Was there a reason to choose "99th" as opposed
> to "p99" or any other scheme? This is not a criticism, I'm just primarily
> asking
> for consistency's sake.
>
> Thanks again,
> -John
>
> On Wed, May 13, 2020, at 19:19, Sophie Blee-Goldman wrote:
> > Alright, I can get behind adding the min metric for the sake of pretty
> > graphs
> > (and trivial computation).
> >
> > I'm still on the fence regarding the mean (or 50th percentile) but I can
> see
> > how users might expect it and find it a bit disorienting not to have. So
> the
> > updated proposed metrics are
> >
> >
> >    - record-staleness-max [ms]
> >    - record-staleness-99th [ms] *(99th percentile)*
> >    - record-staleness-75th [ms] *(75th percentile)*
> >    - record-staleness-avg [ms] *(mean)*
> >    - record-staleness-min [ms]
> >
> >
> > On Wed, May 13, 2020 at 4:42 PM John Roesler <[email protected]>
> wrote:
> >
> > > Oh boy, I never miss an opportunity to embarrass myself. I guess the
> mean
> > > seems more interesting to me than the median, but neither are as
> > > interesting as the higher percentiles (99th and max).
> > >
> > > Min isn’t really important for any SLAs, but it does round out the
> mental
> > > picture of the distribution. I’ve always graphed min along with the
> other
> > > metrics to help me understand how fast the system can be, which helps
> in
> > > optimization decisions. It’s also a relatively inexpensive metric to
> > > compute, so it might be nice to just throw it in.
> > >
> > > On Wed, May 13, 2020, at 18:18, Sophie Blee-Goldman wrote:
> > > > G1:
> > > > I was considering it as the "end-to-end latency *up* to the specific
> > > task"
> > > > but
> > > > I'm happy with "record-staleness" if that drives the point home
> better.
> > > So
> > > > it's the
> > > > "staleness of the record when it is received by that task" -- will
> update
> > > > the KIP
> > > >
> > > > B1/J:
> > > > I'm struggling to imagine a case where the min would actually be
> useful,
> > > > rather than
> > > > just intellectually interesting. I don't feel strongly that we
> shouldn't
> > > > add it, but that's
> > > > why I didn't include it from the start. Can you enlighten me with an
> > > > example?
> > > >
> > > > I was also vaguely concerned about the overhead of adding multiple
> > > > percentile
> > > > metrics. Do we have any data to indicate what kind of performance
> hit we
> > > > take on
> > > > metrics computation?
> > > >
> > > > Also, not to be too pedantic but the 50th percentile would be the
> median
> > > > not the
> > > > mean. Would you propose to add the mean *and* the 50th percentile, or
> > > just
> > > > one
> > > > of the two?
> > > >
> > > > Thanks all!
> > > > Sophie
> > > >
> > > > On Wed, May 13, 2020 at 3:34 PM John Roesler <[email protected]>
> > > wrote:
> > > >
> > > > > Hello all, and thanks for the KIP, Sophie,
> > > > >
> > > > > Just some comments on the discussion so far:
> > > > >
> > > > > B2/G1:
> > > > > In principle, it shouldn't matter whether we report "spans" or
> > > > > "end-to-end" latency. But in practice, some of the spans are pretty
> > > > > difficult to really measure (like time spent waiting in the
> topics, or
> > > > > time from the event happening to the ETL producer choosing to send
> it,
> > > > > or time spent in send/receive buffers, etc., etc.
> > > > >
> > > > > In other words, it's practically easier to compute spans by
> subtracting
> > > > > e2e latencies than it is to compute e2e latencies by adding spans.
> You
> > > > > can even just consider that the span computation from e2e always
> just
> > > > > involves subtracting two numbers, whereas computing e2e latency
> from
> > > > > spans involves adding _all_ the spans leading up to the end you
> care
> > > about.
> > > > >
> > > > > It seems like people really prefer to have spans when they are
> > > debugging
> > > > > latency problems, whereas e2e latency is a more general measurement
> > > > > that basically every person/application cares about and should be
> > > > > monitoring.
> > > > >
> > > > > Altogether, it really seem to provide more value to more people if
> we
> > > > > report
> > > > > e2e latencies. Regarding "record-staleness" as a name, I think I
> have
> > > no
> > > > > preference, I'd defer to other peoples' intuition.
> > > > >
> > > > > G2:
> > > > > I think the processor-node metric is nice, since the inside of a
> task
> > > can
> > > > > introduce a significant amount of latency in some cases. Plus,
> it's a
> > > more
> > > > > direct measurement, if you really wanted to know (for the purposes
> of
> > > IQ
> > > > > or something) how long it takes source events to "show up" at the
> > > store.
> > > > >
> > > > > I think actually recording it at every processor could be
> expensive,
> > > but we
> > > > > already record a bunch of metrics at the node level.
> > > > >
> > > > > B1:
> > > > > I think 50% could be reasonable to record also. Even if it's a poor
> > > metric
> > > > > for operational purposes, a lot of people might expect to see
> "mean".
> > > > > Actually,
> > > > > I was surprised not to see "min". Is there a reason to leave it
> off?
> > > > >
> > > > > I might suggest:
> > > > > min, mean (50th), 75th, 99th, max
> > > > >
> > > > > B3:
> > > > > I agree we should include late records (though not the ones we
> drop).
> > > > > It may be spiky, but only when there are legitimately some records
> > > with a
> > > > > high end-to-end latency, which is the whole point of these metrics.
> > > > >
> > > > > That's it! I don't think I have any other feedback, other than a
> > > request to
> > > > > also report "min".
> > > > >
> > > > > Thanks,
> > > > > -John
> > > > >
> > > > > On Wed, May 13, 2020, at 16:58, Guozhang Wang wrote:
> > > > > > Thanks Sophie for the KIP, a few quick thoughts:
> > > > > >
> > > > > > 1) The end-to-end latency includes both the processing latency
> of the
> > > > > task
> > > > > > and the latency spent sitting in intermediate topics. I have a
> > > similar
> > > > > > feeling as Boyang mentioned above that the latency metric of a
> task A
> > > > > > actually measures the latency of the sub-topology up-to but not
> > > including
> > > > > > the processing of A, which is a bit weird.
> > > > > >
> > > > > > Maybe the my feeling comes from the name "latency" itself, since
> > > today we
> > > > > > already have several "latency" metrics already which are
> measuring
> > > > > elapsed
> > > > > > system-time for processing a record / etc, while here we are
> > > comparing
> > > > > the
> > > > > > system wallclock time with the record timestamp.
> > > > > >
> > > > > > Maybe we can consider renaming it as "record-staleness" (note we
> > > already
> > > > > > have a "record-lateness" metric), in which case recording at the
> > > > > > system-time before we start processing the record sounds more
> > > natural.
> > > > > >
> > > > > > 2) With that in mind, I'm wondering if the processor-node-level
> DEBUG
> > > > > > metric is worth to add, given that we already have a task-level
> > > > > processing
> > > > > > latency metric. Basically, a specific node's e2e latency is
> similar
> > > to
> > > > > the
> > > > > > task-level e2e latency + task-level processing latency.
> Personally I
> > > > > think
> > > > > > having a task-level record-staleness metric is sufficient.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, May 13, 2020 at 11:46 AM Sophie Blee-Goldman <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > 1. I felt that 50% was not a particularly useful gauge for this
> > > > > specific
> > > > > > > metric, as
> > > > > > > it's presumably most useful at putting an *upper *bound on the
> > > latency
> > > > > you
> > > > > > > can
> > > > > > > reasonably expect to see. I chose percentiles that would
> hopefully
> > > > > give a
> > > > > > > good
> > > > > > > sense of what *most* records will experience, and what *close
> to
> > > all*
> > > > > > > records
> > > > > > > will.
> > > > > > >
> > > > > > > However I'm not married to these specific numbers and could be
> > > > > convinced.
> > > > > > > Would be especially interested in hearing from users on this.
> > > > > > >
> > > > > > > 2. I'm inclined to not include the "hop-to-hop latency" in
> this KIP
> > > > > since
> > > > > > > users
> > > > > > > can always compute it themselves by subtracting the previous
> node's
> > > > > > > end-to-end latency. I guess we could do it either way since
> you can
> > > > > always
> > > > > > > compute one from the other, but I think the end-to-end latency
> > > feels
> > > > > more
> > > > > > > valuable as it's main motivation is not to debug bottlenecks
> in the
> > > > > > > topology but
> > > > > > > to give users a sense of how long it takes arecord to be
> reflected
> > > in
> > > > > > > certain parts
> > > > > > > of the topology. For example this might be useful for users
> who are
> > > > > > > wondering
> > > > > > > roughly when a record that was just produced will be included
> in
> > > their
> > > > > IQ
> > > > > > > results.
> > > > > > > Debugging is just a nice side effect -- but maybe I didn't make
> > > that
> > > > > clear
> > > > > > > enough
> > > > > > > in the KIP's motivation.
> > > > > > >
> > > > > > > 3. Good question, I should address this in the KIP. The short
> > > answer is
> > > > > > > "yes",
> > > > > > > we will include late records. I added a paragraph to the end
> of the
> > > > > > > Proposed
> > > > > > > Changes section explaining the reasoning here, please let me
> know
> > > if
> > > > > you
> > > > > > > have
> > > > > > > any concerns.
> > > > > > >
> > > > > > > 4. Assuming you're referring to the existing metric
> > > "process-latency",
> > > > > that
> > > > > > > metric
> > > > > > > reflects the time for the literal Node#process method to run
> > > whereas
> > > > > this
> > > > > > > metric
> > > > > > > would always be measured relative to the event timestamp.
> > > > > > >
> > > > > > > That said, the naming collision there is pretty confusing so
> I've
> > > > > renamed
> > > > > > > the
> > > > > > > metrics in this KIP to "end-to-end-latency" which I feel better
> > > > > reflects
> > > > > > > the nature
> > > > > > > of the metric anyway.
> > > > > > >
> > > > > > > Thanks for the feedback!
> > > > > > >
> > > > > > > On Wed, May 13, 2020 at 10:21 AM Boyang Chen <
> > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the KIP Sophie. Getting the E2E latency is
> important
> > > for
> > > > > > > > understanding the bottleneck of the application.
> > > > > > > >
> > > > > > > > A couple of questions and ideas:
> > > > > > > >
> > > > > > > > 1. Could you clarify the rational of picking 75, 99 and max
> > > > > percentiles?
> > > > > > > > Normally I see cases where we use 50, 90 percentile as well
> in
> > > > > production
> > > > > > > > systems.
> > > > > > > >
> > > > > > > > 2. The current latency being computed is cumulative, I.E if a
> > > record
> > > > > goes
> > > > > > > > through A -> B -> C, then P(C) = T(B->C) + P(B) = T(B->C) +
> > > T(A->B) +
> > > > > > > T(A)
> > > > > > > > and so on, where P() represents the captured latency, and T()
> > > > > represents
> > > > > > > > the time for transiting the records between two nodes,
> including
> > > > > > > processing
> > > > > > > > time. For monitoring purpose, maybe having T(B->C) and
> T(A->B)
> > > are
> > > > > more
> > > > > > > > natural to view as "hop-to-hop latency", otherwise if there
> is a
> > > > > spike in
> > > > > > > > T(A->B), both P(B) and P(C) are affected in the same time.
> In the
> > > > > same
> > > > > > > > spirit, the E2E latency is meaningful only when the record
> exits
> > > > > from the
> > > > > > > > sink as this marks the whole time this record spent inside
> the
> > > > > funnel. Do
> > > > > > > > you think we could have separate treatment for sink nodes and
> > > other
> > > > > > > > nodes, so that other nodes only count the time receiving the
> > > record
> > > > > from
> > > > > > > > last hop? I'm not proposing a solution here, just want to
> discuss
> > > > > this
> > > > > > > > alternative to see if it is reasonable.
> > > > > > > >
> > > > > > > > 3. As we are going to monitor late arrival records as well,
> they
> > > > > would
> > > > > > > > create some really spiky graphs when the out-of-order
> records are
> > > > > > > > interleaving with on time records. Should we also supply a
> smooth
> > > > > version
> > > > > > > > of the latency metrics, or user should just take care of it
> by
> > > > > themself?
> > > > > > > >
> > > > > > > > 4. Regarding this new metrics, we haven't discussed its
> relation
> > > > > with our
> > > > > > > > existing processing latency metrics, could you add some
> context
> > > on
> > > > > > > > comparison and a simple `when to use which` tutorial for the
> > > best?
> > > > > > > >
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Tue, May 12, 2020 at 7:28 PM Sophie Blee-Goldman <
> > > > > [email protected]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey all,
> > > > > > > > >
> > > > > > > > > I'd like to kick off discussion on KIP-613 which aims to
> add
> > > > > end-to-end
> > > > > > > > > latency metrics to Streams. Please take a look:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Sophie
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Reply via email to