Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Matthias J. Sax Thu, 14 May 2020 22:11:27 -0700

Thanks for the KIP Sophie.

I think it's not useful to record the avg/mean; it sensitive to
outliers. We should rather track the median (50th percentile).


Not sure if tracking min is useful, but I am also ok to track it.

However, I find it odd to track 75th percentile. Standard measures would
the 90th or 95th -- I guess we don't need both, so maybe picking 90th
might be more useful?

About the name: "staleness" wound really odd, and if fact the metric
does capture "latency" so we should call it "latency". I understand the
issue that we already have a latency metric. So maybe we could call it
`record-e2e-latency-*` ?

While I agree that we should include out-or-order data (the KIP should
talk about `out-of-order` data, not `late` data; data is only `late` if
it's out-of-order and if it's dropped), I don't really understand why
the new metric would help to configure grace period or retention time?
As you mention in the KIP, both are define as max difference of
`event-time - stream-time` and thus the new metric that takes
system-/wallclock-time into account does not seem to help at all.


Btw: there is a great talk about "How NOT to Measure Latency" by Gil
Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU


-Matthias


On 5/14/20 7:17 PM, John Roesler wrote:
> Hi Sophie,
> 
> It seems like there would still be plenty of use cases for recording
> this metric at all processors and not just stateful ones, but I'm happy
> to suspend my arguments for now. Since you're proposing to keep
> them at the processor-node level, it will be seamless later to add
> in the stateless processors if we want. As a wise man once said,
> "Adding is always easier than removing."
> 
> Regarding the time measurement, it's an implementation detail
> we don't need to consider in the KIP. Nevertheless, I'd greatly
> prefer to measure the system time again when recording the
> metric. I don't think we've seen any evidence that proves this
> would harm performance, and the amount of inaccuracy using
> the cached system time could incur is actually substantial. But,
> if you want to just "not mention this" in the KIP, we can defer to
> the actual PR discussion, at which time we're in a better position
> to use benchmarks, etc., to make the call.
> 
> Along the lines of the measurement accuracy discussion, one
> minor thought I had is that maybe we should consider measuring
> the task staleness metric at the sink, rather than the source, so that
> it includes the processing latency of the task itself, not just the latency
> of everything up to, but not including, the task (which seems confusing
> for users). I guess this could also be an implementation detail, though.
> 
> Thanks for the update,
> -John
> 
> On Thu, May 14, 2020, at 13:31, Sophie Blee-Goldman wrote:
>> Hey all,
>>
>> After discussing with Bruno I'd like to propose a small amendment,
>> which is to record the processor-node-level metrics only for *stateful*
>> *operators*. They would still be considered a "processor-node-level"
>> metric and not a "state-store-level" metric as the staleness is still
>> a property of the node rather than of the state itself. However, it seems
>> that this information is primarily useful for stateful operators that might
>> be exposing state via IQ or otherwise dependent on the record time
>> unlike a stateless operator.
>>
>> It's worth calling out that recent performance improvements to the metrics
>> framework mean that we no longer fetch the system time at the operator
>> level, but only once per task. In other words the system time is not updated
>> between each process as a record flows through the subtopology, so
>> debugging the processor-level latency via the stateleness will not be
>> possible.Note that this doesn't mean the operator-level metrics are not
>> *useful* relative to the task-level metric. Upstream caching and/or
>> suppression
>> can still cause a record's staleness at some downstream stateful operator
>> to deviate from the task-level staleness (recorded at the source node).
>>
>> Please let me know if you have any concerns about this change. The
>> KIP has been updated with the new proposal
>>
>> On Thu, May 14, 2020 at 3:04 AM Bruno Cadonna <[email protected]> wrote:
>>
>>> Hi Sophie,
>>>
>>> Thank you for the KIP.
>>>
>>> The KIP looks good to me.
>>>
>>> 50th percentile:
>>> I think we do not need it now. If we need it, we can add it. Here the
>>> old truism applies: Adding is always easier than removing.
>>>
>>> processor-node-level metrics:
>>> I think it is good to have the staleness metrics also on
>>> processor-node-level. If we do not want to record them on all
>>> processor nodes, you could restrict the recording to stateful
>>> processor-nodes, since those are the ones that would benefit most from
>>> the staleness metrics.
>>>
>>> Best,
>>> Bruno
>>>
>>> On Thu, May 14, 2020 at 4:15 AM Sophie Blee-Goldman <[email protected]>
>>> wrote:
>>>>
>>>> Yeah, the specific reason was just to align with the current metrics.
>>>>
>>>> Is it better to conform than to be right? History has a lot to say on
>>> that
>>>> matter
>>>> but I'm not sure how much of it applies to the fine details of metrics
>>>> naming :P
>>>>
>>>> More seriously, I figured if people are looking at this metric they're
>>>> likely to
>>>> be looking at all the others. Then naming this one "-mean" would probably
>>>> lead some to conclude that the "-avg" suffix in the other metrics has a
>>>> different meaning.
>>>>
>>>> As for the percentiles, I actually like p99 (and p75) better. I'll swap
>>>> that out
>>>>
>>>> On Wed, May 13, 2020 at 7:07 PM John Roesler <[email protected]>
>>> wrote:
>>>>
>>>>> Thanks Sophie,
>>>>>
>>>>> I hope this isn't too nit-picky, but is there a reason to choose "avg"
>>>>> instead
>>>>> of "mean"? Maybe this is too paranoid, and I might be oversensitive
>>> because
>>>>> of the mistake I just made earlier, but it strikes me that "avg" is
>>>>> actually
>>>>> ambiguous, as it refers to a family of statistics, whereas "mean" is
>>>>> specific.
>>>>> I see other Kafka metrics with "avg", but none with "mean"; was that
>>> the
>>>>> reason? If so, I'm +1.
>>>>>
>>>>> Regarding the names of the percentile, I actually couldn't find _any_
>>> other
>>>>> metrics that use percentile. Was there a reason to choose "99th" as
>>> opposed
>>>>> to "p99" or any other scheme? This is not a criticism, I'm just
>>> primarily
>>>>> asking
>>>>> for consistency's sake.
>>>>>
>>>>> Thanks again,
>>>>> -John
>>>>>
>>>>> On Wed, May 13, 2020, at 19:19, Sophie Blee-Goldman wrote:
>>>>>> Alright, I can get behind adding the min metric for the sake of
>>> pretty
>>>>>> graphs
>>>>>> (and trivial computation).
>>>>>>
>>>>>> I'm still on the fence regarding the mean (or 50th percentile) but I
>>> can
>>>>> see
>>>>>> how users might expect it and find it a bit disorienting not to
>>> have. So
>>>>> the
>>>>>> updated proposed metrics are
>>>>>>
>>>>>>
>>>>>>    - record-staleness-max [ms]
>>>>>>    - record-staleness-99th [ms] *(99th percentile)*
>>>>>>    - record-staleness-75th [ms] *(75th percentile)*
>>>>>>    - record-staleness-avg [ms] *(mean)*
>>>>>>    - record-staleness-min [ms]
>>>>>>
>>>>>>
>>>>>> On Wed, May 13, 2020 at 4:42 PM John Roesler <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>> Oh boy, I never miss an opportunity to embarrass myself. I guess
>>> the
>>>>> mean
>>>>>>> seems more interesting to me than the median, but neither are as
>>>>>>> interesting as the higher percentiles (99th and max).
>>>>>>>
>>>>>>> Min isn’t really important for any SLAs, but it does round out the
>>>>> mental
>>>>>>> picture of the distribution. I’ve always graphed min along with the
>>>>> other
>>>>>>> metrics to help me understand how fast the system can be, which
>>> helps
>>>>> in
>>>>>>> optimization decisions. It’s also a relatively inexpensive metric
>>> to
>>>>>>> compute, so it might be nice to just throw it in.
>>>>>>>
>>>>>>> On Wed, May 13, 2020, at 18:18, Sophie Blee-Goldman wrote:
>>>>>>>> G1:
>>>>>>>> I was considering it as the "end-to-end latency *up* to the
>>> specific
>>>>>>> task"
>>>>>>>> but
>>>>>>>> I'm happy with "record-staleness" if that drives the point home
>>>>> better.
>>>>>>> So
>>>>>>>> it's the
>>>>>>>> "staleness of the record when it is received by that task" --
>>> will
>>>>> update
>>>>>>>> the KIP
>>>>>>>>
>>>>>>>> B1/J:
>>>>>>>> I'm struggling to imagine a case where the min would actually be
>>>>> useful,
>>>>>>>> rather than
>>>>>>>> just intellectually interesting. I don't feel strongly that we
>>>>> shouldn't
>>>>>>>> add it, but that's
>>>>>>>> why I didn't include it from the start. Can you enlighten me
>>> with an
>>>>>>>> example?
>>>>>>>>
>>>>>>>> I was also vaguely concerned about the overhead of adding
>>> multiple
>>>>>>>> percentile
>>>>>>>> metrics. Do we have any data to indicate what kind of performance
>>>>> hit we
>>>>>>>> take on
>>>>>>>> metrics computation?
>>>>>>>>
>>>>>>>> Also, not to be too pedantic but the 50th percentile would be the
>>>>> median
>>>>>>>> not the
>>>>>>>> mean. Would you propose to add the mean *and* the 50th
>>> percentile, or
>>>>>>> just
>>>>>>>> one
>>>>>>>> of the two?
>>>>>>>>
>>>>>>>> Thanks all!
>>>>>>>> Sophie
>>>>>>>>
>>>>>>>> On Wed, May 13, 2020 at 3:34 PM John Roesler <
>>> [email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello all, and thanks for the KIP, Sophie,
>>>>>>>>>
>>>>>>>>> Just some comments on the discussion so far:
>>>>>>>>>
>>>>>>>>> B2/G1:
>>>>>>>>> In principle, it shouldn't matter whether we report "spans" or
>>>>>>>>> "end-to-end" latency. But in practice, some of the spans are
>>> pretty
>>>>>>>>> difficult to really measure (like time spent waiting in the
>>>>> topics, or
>>>>>>>>> time from the event happening to the ETL producer choosing to
>>> send
>>>>> it,
>>>>>>>>> or time spent in send/receive buffers, etc., etc.
>>>>>>>>>
>>>>>>>>> In other words, it's practically easier to compute spans by
>>>>> subtracting
>>>>>>>>> e2e latencies than it is to compute e2e latencies by adding
>>> spans.
>>>>> You
>>>>>>>>> can even just consider that the span computation from e2e
>>> always
>>>>> just
>>>>>>>>> involves subtracting two numbers, whereas computing e2e latency
>>>>> from
>>>>>>>>> spans involves adding _all_ the spans leading up to the end you
>>>>> care
>>>>>>> about.
>>>>>>>>>
>>>>>>>>> It seems like people really prefer to have spans when they are
>>>>>>> debugging
>>>>>>>>> latency problems, whereas e2e latency is a more general
>>> measurement
>>>>>>>>> that basically every person/application cares about and should
>>> be
>>>>>>>>> monitoring.
>>>>>>>>>
>>>>>>>>> Altogether, it really seem to provide more value to more
>>> people if
>>>>> we
>>>>>>>>> report
>>>>>>>>> e2e latencies. Regarding "record-staleness" as a name, I think
>>> I
>>>>> have
>>>>>>> no
>>>>>>>>> preference, I'd defer to other peoples' intuition.
>>>>>>>>>
>>>>>>>>> G2:
>>>>>>>>> I think the processor-node metric is nice, since the inside of
>>> a
>>>>> task
>>>>>>> can
>>>>>>>>> introduce a significant amount of latency in some cases. Plus,
>>>>> it's a
>>>>>>> more
>>>>>>>>> direct measurement, if you really wanted to know (for the
>>> purposes
>>>>> of
>>>>>>> IQ
>>>>>>>>> or something) how long it takes source events to "show up" at
>>> the
>>>>>>> store.
>>>>>>>>>
>>>>>>>>> I think actually recording it at every processor could be
>>>>> expensive,
>>>>>>> but we
>>>>>>>>> already record a bunch of metrics at the node level.
>>>>>>>>>
>>>>>>>>> B1:
>>>>>>>>> I think 50% could be reasonable to record also. Even if it's a
>>> poor
>>>>>>> metric
>>>>>>>>> for operational purposes, a lot of people might expect to see
>>>>> "mean".
>>>>>>>>> Actually,
>>>>>>>>> I was surprised not to see "min". Is there a reason to leave it
>>>>> off?
>>>>>>>>>
>>>>>>>>> I might suggest:
>>>>>>>>> min, mean (50th), 75th, 99th, max
>>>>>>>>>
>>>>>>>>> B3:
>>>>>>>>> I agree we should include late records (though not the ones we
>>>>> drop).
>>>>>>>>> It may be spiky, but only when there are legitimately some
>>> records
>>>>>>> with a
>>>>>>>>> high end-to-end latency, which is the whole point of these
>>> metrics.
>>>>>>>>>
>>>>>>>>> That's it! I don't think I have any other feedback, other than
>>> a
>>>>>>> request to
>>>>>>>>> also report "min".
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>> On Wed, May 13, 2020, at 16:58, Guozhang Wang wrote:
>>>>>>>>>> Thanks Sophie for the KIP, a few quick thoughts:
>>>>>>>>>>
>>>>>>>>>> 1) The end-to-end latency includes both the processing
>>> latency
>>>>> of the
>>>>>>>>> task
>>>>>>>>>> and the latency spent sitting in intermediate topics. I have
>>> a
>>>>>>> similar
>>>>>>>>>> feeling as Boyang mentioned above that the latency metric of
>>> a
>>>>> task A
>>>>>>>>>> actually measures the latency of the sub-topology up-to but
>>> not
>>>>>>> including
>>>>>>>>>> the processing of A, which is a bit weird.
>>>>>>>>>>
>>>>>>>>>> Maybe the my feeling comes from the name "latency" itself,
>>> since
>>>>>>> today we
>>>>>>>>>> already have several "latency" metrics already which are
>>>>> measuring
>>>>>>>>> elapsed
>>>>>>>>>> system-time for processing a record / etc, while here we are
>>>>>>> comparing
>>>>>>>>> the
>>>>>>>>>> system wallclock time with the record timestamp.
>>>>>>>>>>
>>>>>>>>>> Maybe we can consider renaming it as "record-staleness"
>>> (note we
>>>>>>> already
>>>>>>>>>> have a "record-lateness" metric), in which case recording at
>>> the
>>>>>>>>>> system-time before we start processing the record sounds more
>>>>>>> natural.
>>>>>>>>>>
>>>>>>>>>> 2) With that in mind, I'm wondering if the
>>> processor-node-level
>>>>> DEBUG
>>>>>>>>>> metric is worth to add, given that we already have a
>>> task-level
>>>>>>>>> processing
>>>>>>>>>> latency metric. Basically, a specific node's e2e latency is
>>>>> similar
>>>>>>> to
>>>>>>>>> the
>>>>>>>>>> task-level e2e latency + task-level processing latency.
>>>>> Personally I
>>>>>>>>> think
>>>>>>>>>> having a task-level record-staleness metric is sufficient.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 13, 2020 at 11:46 AM Sophie Blee-Goldman <
>>>>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> 1. I felt that 50% was not a particularly useful gauge for
>>> this
>>>>>>>>> specific
>>>>>>>>>>> metric, as
>>>>>>>>>>> it's presumably most useful at putting an *upper *bound on
>>> the
>>>>>>> latency
>>>>>>>>> you
>>>>>>>>>>> can
>>>>>>>>>>> reasonably expect to see. I chose percentiles that would
>>>>> hopefully
>>>>>>>>> give a
>>>>>>>>>>> good
>>>>>>>>>>> sense of what *most* records will experience, and what
>>> *close
>>>>> to
>>>>>>> all*
>>>>>>>>>>> records
>>>>>>>>>>> will.
>>>>>>>>>>>
>>>>>>>>>>> However I'm not married to these specific numbers and
>>> could be
>>>>>>>>> convinced.
>>>>>>>>>>> Would be especially interested in hearing from users on
>>> this.
>>>>>>>>>>>
>>>>>>>>>>> 2. I'm inclined to not include the "hop-to-hop latency" in
>>>>> this KIP
>>>>>>>>> since
>>>>>>>>>>> users
>>>>>>>>>>> can always compute it themselves by subtracting the
>>> previous
>>>>> node's
>>>>>>>>>>> end-to-end latency. I guess we could do it either way since
>>>>> you can
>>>>>>>>> always
>>>>>>>>>>> compute one from the other, but I think the end-to-end
>>> latency
>>>>>>> feels
>>>>>>>>> more
>>>>>>>>>>> valuable as it's main motivation is not to debug
>>> bottlenecks
>>>>> in the
>>>>>>>>>>> topology but
>>>>>>>>>>> to give users a sense of how long it takes arecord to be
>>>>> reflected
>>>>>>> in
>>>>>>>>>>> certain parts
>>>>>>>>>>> of the topology. For example this might be useful for users
>>>>> who are
>>>>>>>>>>> wondering
>>>>>>>>>>> roughly when a record that was just produced will be
>>> included
>>>>> in
>>>>>>> their
>>>>>>>>> IQ
>>>>>>>>>>> results.
>>>>>>>>>>> Debugging is just a nice side effect -- but maybe I didn't
>>> make
>>>>>>> that
>>>>>>>>> clear
>>>>>>>>>>> enough
>>>>>>>>>>> in the KIP's motivation.
>>>>>>>>>>>
>>>>>>>>>>> 3. Good question, I should address this in the KIP. The
>>> short
>>>>>>> answer is
>>>>>>>>>>> "yes",
>>>>>>>>>>> we will include late records. I added a paragraph to the
>>> end
>>>>> of the
>>>>>>>>>>> Proposed
>>>>>>>>>>> Changes section explaining the reasoning here, please let
>>> me
>>>>> know
>>>>>>> if
>>>>>>>>> you
>>>>>>>>>>> have
>>>>>>>>>>> any concerns.
>>>>>>>>>>>
>>>>>>>>>>> 4. Assuming you're referring to the existing metric
>>>>>>> "process-latency",
>>>>>>>>> that
>>>>>>>>>>> metric
>>>>>>>>>>> reflects the time for the literal Node#process method to
>>> run
>>>>>>> whereas
>>>>>>>>> this
>>>>>>>>>>> metric
>>>>>>>>>>> would always be measured relative to the event timestamp.
>>>>>>>>>>>
>>>>>>>>>>> That said, the naming collision there is pretty confusing
>>> so
>>>>> I've
>>>>>>>>> renamed
>>>>>>>>>>> the
>>>>>>>>>>> metrics in this KIP to "end-to-end-latency" which I feel
>>> better
>>>>>>>>> reflects
>>>>>>>>>>> the nature
>>>>>>>>>>> of the metric anyway.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the feedback!
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 13, 2020 at 10:21 AM Boyang Chen <
>>>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the KIP Sophie. Getting the E2E latency is
>>>>> important
>>>>>>> for
>>>>>>>>>>>> understanding the bottleneck of the application.
>>>>>>>>>>>>
>>>>>>>>>>>> A couple of questions and ideas:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Could you clarify the rational of picking 75, 99 and
>>> max
>>>>>>>>> percentiles?
>>>>>>>>>>>> Normally I see cases where we use 50, 90 percentile as
>>> well
>>>>> in
>>>>>>>>> production
>>>>>>>>>>>> systems.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. The current latency being computed is cumulative, I.E
>>> if a
>>>>>>> record
>>>>>>>>> goes
>>>>>>>>>>>> through A -> B -> C, then P(C) = T(B->C) + P(B) =
>>> T(B->C) +
>>>>>>> T(A->B) +
>>>>>>>>>>> T(A)
>>>>>>>>>>>> and so on, where P() represents the captured latency,
>>> and T()
>>>>>>>>> represents
>>>>>>>>>>>> the time for transiting the records between two nodes,
>>>>> including
>>>>>>>>>>> processing
>>>>>>>>>>>> time. For monitoring purpose, maybe having T(B->C) and
>>>>> T(A->B)
>>>>>>> are
>>>>>>>>> more
>>>>>>>>>>>> natural to view as "hop-to-hop latency", otherwise if
>>> there
>>>>> is a
>>>>>>>>> spike in
>>>>>>>>>>>> T(A->B), both P(B) and P(C) are affected in the same
>>> time.
>>>>> In the
>>>>>>>>> same
>>>>>>>>>>>> spirit, the E2E latency is meaningful only when the
>>> record
>>>>> exits
>>>>>>>>> from the
>>>>>>>>>>>> sink as this marks the whole time this record spent
>>> inside
>>>>> the
>>>>>>>>> funnel. Do
>>>>>>>>>>>> you think we could have separate treatment for sink
>>> nodes and
>>>>>>> other
>>>>>>>>>>>> nodes, so that other nodes only count the time receiving
>>> the
>>>>>>> record
>>>>>>>>> from
>>>>>>>>>>>> last hop? I'm not proposing a solution here, just want to
>>>>> discuss
>>>>>>>>> this
>>>>>>>>>>>> alternative to see if it is reasonable.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. As we are going to monitor late arrival records as
>>> well,
>>>>> they
>>>>>>>>> would
>>>>>>>>>>>> create some really spiky graphs when the out-of-order
>>>>> records are
>>>>>>>>>>>> interleaving with on time records. Should we also supply
>>> a
>>>>> smooth
>>>>>>>>> version
>>>>>>>>>>>> of the latency metrics, or user should just take care of
>>> it
>>>>> by
>>>>>>>>> themself?
>>>>>>>>>>>>
>>>>>>>>>>>> 4. Regarding this new metrics, we haven't discussed its
>>>>> relation
>>>>>>>>> with our
>>>>>>>>>>>> existing processing latency metrics, could you add some
>>>>> context
>>>>>>> on
>>>>>>>>>>>> comparison and a simple `when to use which` tutorial for
>>> the
>>>>>>> best?
>>>>>>>>>>>>
>>>>>>>>>>>> Boyang
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 12, 2020 at 7:28 PM Sophie Blee-Goldman <
>>>>>>>>> [email protected]
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to kick off discussion on KIP-613 which aims
>>> to
>>>>> add
>>>>>>>>> end-to-end
>>>>>>>>>>>>> latency metrics to Streams. Please take a look:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Reply via email to