Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Sophie Blee-Goldman Wed, 13 May 2020 16:18:31 -0700

G1:
I was considering it as the "end-to-end latency *up* to the specific task"
but
I'm happy with "record-staleness" if that drives the point home better. So
it's the
"staleness of the record when it is received by that task" -- will update
the KIP


B1/J:
I'm struggling to imagine a case where the min would actually be useful,
rather than
just intellectually interesting. I don't feel strongly that we shouldn't
add it, but that's
why I didn't include it from the start. Can you enlighten me with an
example?

I was also vaguely concerned about the overhead of adding multiple
percentile
metrics. Do we have any data to indicate what kind of performance hit we
take on
metrics computation?

Also, not to be too pedantic but the 50th percentile would be the median
not the
mean. Would you propose to add the mean *and* the 50th percentile, or just
one
of the two?

Thanks all!
Sophie

On Wed, May 13, 2020 at 3:34 PM John Roesler <[email protected]> wrote:

> Hello all, and thanks for the KIP, Sophie,
>
> Just some comments on the discussion so far:
>
> B2/G1:
> In principle, it shouldn't matter whether we report "spans" or
> "end-to-end" latency. But in practice, some of the spans are pretty
> difficult to really measure (like time spent waiting in the topics, or
> time from the event happening to the ETL producer choosing to send it,
> or time spent in send/receive buffers, etc., etc.
>
> In other words, it's practically easier to compute spans by subtracting
> e2e latencies than it is to compute e2e latencies by adding spans. You
> can even just consider that the span computation from e2e always just
> involves subtracting two numbers, whereas computing e2e latency from
> spans involves adding _all_ the spans leading up to the end you care about.
>
> It seems like people really prefer to have spans when they are debugging
> latency problems, whereas e2e latency is a more general measurement
> that basically every person/application cares about and should be
> monitoring.
>
> Altogether, it really seem to provide more value to more people if we
> report
> e2e latencies. Regarding "record-staleness" as a name, I think I have no
> preference, I'd defer to other peoples' intuition.
>
> G2:
> I think the processor-node metric is nice, since the inside of a task can
> introduce a significant amount of latency in some cases. Plus, it's a more
> direct measurement, if you really wanted to know (for the purposes of IQ
> or something) how long it takes source events to "show up" at the store.
>
> I think actually recording it at every processor could be expensive, but we
> already record a bunch of metrics at the node level.
>
> B1:
> I think 50% could be reasonable to record also. Even if it's a poor metric
> for operational purposes, a lot of people might expect to see "mean".
> Actually,
> I was surprised not to see "min". Is there a reason to leave it off?
>
> I might suggest:
> min, mean (50th), 75th, 99th, max
>
> B3:
> I agree we should include late records (though not the ones we drop).
> It may be spiky, but only when there are legitimately some records with a
> high end-to-end latency, which is the whole point of these metrics.
>
> That's it! I don't think I have any other feedback, other than a request to
> also report "min".
>
> Thanks,
> -John
>
> On Wed, May 13, 2020, at 16:58, Guozhang Wang wrote:
> > Thanks Sophie for the KIP, a few quick thoughts:
> >
> > 1) The end-to-end latency includes both the processing latency of the
> task
> > and the latency spent sitting in intermediate topics. I have a similar
> > feeling as Boyang mentioned above that the latency metric of a task A
> > actually measures the latency of the sub-topology up-to but not including
> > the processing of A, which is a bit weird.
> >
> > Maybe the my feeling comes from the name "latency" itself, since today we
> > already have several "latency" metrics already which are measuring
> elapsed
> > system-time for processing a record / etc, while here we are comparing
> the
> > system wallclock time with the record timestamp.
> >
> > Maybe we can consider renaming it as "record-staleness" (note we already
> > have a "record-lateness" metric), in which case recording at the
> > system-time before we start processing the record sounds more natural.
> >
> > 2) With that in mind, I'm wondering if the processor-node-level DEBUG
> > metric is worth to add, given that we already have a task-level
> processing
> > latency metric. Basically, a specific node's e2e latency is similar to
> the
> > task-level e2e latency + task-level processing latency. Personally I
> think
> > having a task-level record-staleness metric is sufficient.
> >
> >
> >
> > Guozhang
> >
> >
> >
> > On Wed, May 13, 2020 at 11:46 AM Sophie Blee-Goldman <
> [email protected]>
> > wrote:
> >
> > > 1. I felt that 50% was not a particularly useful gauge for this
> specific
> > > metric, as
> > > it's presumably most useful at putting an *upper *bound on the latency
> you
> > > can
> > > reasonably expect to see. I chose percentiles that would hopefully
> give a
> > > good
> > > sense of what *most* records will experience, and what *close to all*
> > > records
> > > will.
> > >
> > > However I'm not married to these specific numbers and could be
> convinced.
> > > Would be especially interested in hearing from users on this.
> > >
> > > 2. I'm inclined to not include the "hop-to-hop latency" in this KIP
> since
> > > users
> > > can always compute it themselves by subtracting the previous node's
> > > end-to-end latency. I guess we could do it either way since you can
> always
> > > compute one from the other, but I think the end-to-end latency feels
> more
> > > valuable as it's main motivation is not to debug bottlenecks in the
> > > topology but
> > > to give users a sense of how long it takes arecord to be reflected in
> > > certain parts
> > > of the topology. For example this might be useful for users who are
> > > wondering
> > > roughly when a record that was just produced will be included in their
> IQ
> > > results.
> > > Debugging is just a nice side effect -- but maybe I didn't make that
> clear
> > > enough
> > > in the KIP's motivation.
> > >
> > > 3. Good question, I should address this in the KIP. The short answer is
> > > "yes",
> > > we will include late records. I added a paragraph to the end of the
> > > Proposed
> > > Changes section explaining the reasoning here, please let me know if
> you
> > > have
> > > any concerns.
> > >
> > > 4. Assuming you're referring to the existing metric "process-latency",
> that
> > > metric
> > > reflects the time for the literal Node#process method to run whereas
> this
> > > metric
> > > would always be measured relative to the event timestamp.
> > >
> > > That said, the naming collision there is pretty confusing so I've
> renamed
> > > the
> > > metrics in this KIP to "end-to-end-latency" which I feel better
> reflects
> > > the nature
> > > of the metric anyway.
> > >
> > > Thanks for the feedback!
> > >
> > > On Wed, May 13, 2020 at 10:21 AM Boyang Chen <
> [email protected]>
> > > wrote:
> > >
> > > > Thanks for the KIP Sophie. Getting the E2E latency is important for
> > > > understanding the bottleneck of the application.
> > > >
> > > > A couple of questions and ideas:
> > > >
> > > > 1. Could you clarify the rational of picking 75, 99 and max
> percentiles?
> > > > Normally I see cases where we use 50, 90 percentile as well in
> production
> > > > systems.
> > > >
> > > > 2. The current latency being computed is cumulative, I.E if a record
> goes
> > > > through A -> B -> C, then P(C) = T(B->C) + P(B) = T(B->C) + T(A->B) +
> > > T(A)
> > > > and so on, where P() represents the captured latency, and T()
> represents
> > > > the time for transiting the records between two nodes, including
> > > processing
> > > > time. For monitoring purpose, maybe having T(B->C) and T(A->B) are
> more
> > > > natural to view as "hop-to-hop latency", otherwise if there is a
> spike in
> > > > T(A->B), both P(B) and P(C) are affected in the same time. In the
> same
> > > > spirit, the E2E latency is meaningful only when the record exits
> from the
> > > > sink as this marks the whole time this record spent inside the
> funnel. Do
> > > > you think we could have separate treatment for sink nodes and other
> > > > nodes, so that other nodes only count the time receiving the record
> from
> > > > last hop? I'm not proposing a solution here, just want to discuss
> this
> > > > alternative to see if it is reasonable.
> > > >
> > > > 3. As we are going to monitor late arrival records as well, they
> would
> > > > create some really spiky graphs when the out-of-order records are
> > > > interleaving with on time records. Should we also supply a smooth
> version
> > > > of the latency metrics, or user should just take care of it by
> themself?
> > > >
> > > > 4. Regarding this new metrics, we haven't discussed its relation
> with our
> > > > existing processing latency metrics, could you add some context on
> > > > comparison and a simple `when to use which` tutorial for the best?
> > > >
> > > > Boyang
> > > >
> > > > On Tue, May 12, 2020 at 7:28 PM Sophie Blee-Goldman <
> [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > I'd like to kick off discussion on KIP-613 which aims to add
> end-to-end
> > > > > latency metrics to Streams. Please take a look:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
> > > > >
> > > > > Cheers,
> > > > > Sophie
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Reply via email to