Thanks for the KIP Sophie. I think it's not useful to record the avg/mean; it sensitive to outliers. We should rather track the median (50th percentile).
Not sure if tracking min is useful, but I am also ok to track it. However, I find it odd to track 75th percentile. Standard measures would the 90th or 95th -- I guess we don't need both, so maybe picking 90th might be more useful? About the name: "staleness" wound really odd, and if fact the metric does capture "latency" so we should call it "latency". I understand the issue that we already have a latency metric. So maybe we could call it `record-e2e-latency-*` ? While I agree that we should include out-or-order data (the KIP should talk about `out-of-order` data, not `late` data; data is only `late` if it's out-of-order and if it's dropped), I don't really understand why the new metric would help to configure grace period or retention time? As you mention in the KIP, both are define as max difference of `event-time - stream-time` and thus the new metric that takes system-/wallclock-time into account does not seem to help at all. Btw: there is a great talk about "How NOT to Measure Latency" by Gil Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU -Matthias On 5/14/20 7:17 PM, John Roesler wrote: > Hi Sophie, > > It seems like there would still be plenty of use cases for recording > this metric at all processors and not just stateful ones, but I'm happy > to suspend my arguments for now. Since you're proposing to keep > them at the processor-node level, it will be seamless later to add > in the stateless processors if we want. As a wise man once said, > "Adding is always easier than removing." > > Regarding the time measurement, it's an implementation detail > we don't need to consider in the KIP. Nevertheless, I'd greatly > prefer to measure the system time again when recording the > metric. I don't think we've seen any evidence that proves this > would harm performance, and the amount of inaccuracy using > the cached system time could incur is actually substantial. But, > if you want to just "not mention this" in the KIP, we can defer to > the actual PR discussion, at which time we're in a better position > to use benchmarks, etc., to make the call. > > Along the lines of the measurement accuracy discussion, one > minor thought I had is that maybe we should consider measuring > the task staleness metric at the sink, rather than the source, so that > it includes the processing latency of the task itself, not just the latency > of everything up to, but not including, the task (which seems confusing > for users). I guess this could also be an implementation detail, though. > > Thanks for the update, > -John > > On Thu, May 14, 2020, at 13:31, Sophie Blee-Goldman wrote: >> Hey all, >> >> After discussing with Bruno I'd like to propose a small amendment, >> which is to record the processor-node-level metrics only for *stateful* >> *operators*. They would still be considered a "processor-node-level" >> metric and not a "state-store-level" metric as the staleness is still >> a property of the node rather than of the state itself. However, it seems >> that this information is primarily useful for stateful operators that might >> be exposing state via IQ or otherwise dependent on the record time >> unlike a stateless operator. >> >> It's worth calling out that recent performance improvements to the metrics >> framework mean that we no longer fetch the system time at the operator >> level, but only once per task. In other words the system time is not updated >> between each process as a record flows through the subtopology, so >> debugging the processor-level latency via the stateleness will not be >> possible.Note that this doesn't mean the operator-level metrics are not >> *useful* relative to the task-level metric. Upstream caching and/or >> suppression >> can still cause a record's staleness at some downstream stateful operator >> to deviate from the task-level staleness (recorded at the source node). >> >> Please let me know if you have any concerns about this change. The >> KIP has been updated with the new proposal >> >> On Thu, May 14, 2020 at 3:04 AM Bruno Cadonna <br...@confluent.io> wrote: >> >>> Hi Sophie, >>> >>> Thank you for the KIP. >>> >>> The KIP looks good to me. >>> >>> 50th percentile: >>> I think we do not need it now. If we need it, we can add it. Here the >>> old truism applies: Adding is always easier than removing. >>> >>> processor-node-level metrics: >>> I think it is good to have the staleness metrics also on >>> processor-node-level. If we do not want to record them on all >>> processor nodes, you could restrict the recording to stateful >>> processor-nodes, since those are the ones that would benefit most from >>> the staleness metrics. >>> >>> Best, >>> Bruno >>> >>> On Thu, May 14, 2020 at 4:15 AM Sophie Blee-Goldman <sop...@confluent.io> >>> wrote: >>>> >>>> Yeah, the specific reason was just to align with the current metrics. >>>> >>>> Is it better to conform than to be right? History has a lot to say on >>> that >>>> matter >>>> but I'm not sure how much of it applies to the fine details of metrics >>>> naming :P >>>> >>>> More seriously, I figured if people are looking at this metric they're >>>> likely to >>>> be looking at all the others. Then naming this one "-mean" would probably >>>> lead some to conclude that the "-avg" suffix in the other metrics has a >>>> different meaning. >>>> >>>> As for the percentiles, I actually like p99 (and p75) better. I'll swap >>>> that out >>>> >>>> On Wed, May 13, 2020 at 7:07 PM John Roesler <vvcep...@apache.org> >>> wrote: >>>> >>>>> Thanks Sophie, >>>>> >>>>> I hope this isn't too nit-picky, but is there a reason to choose "avg" >>>>> instead >>>>> of "mean"? Maybe this is too paranoid, and I might be oversensitive >>> because >>>>> of the mistake I just made earlier, but it strikes me that "avg" is >>>>> actually >>>>> ambiguous, as it refers to a family of statistics, whereas "mean" is >>>>> specific. >>>>> I see other Kafka metrics with "avg", but none with "mean"; was that >>> the >>>>> reason? If so, I'm +1. >>>>> >>>>> Regarding the names of the percentile, I actually couldn't find _any_ >>> other >>>>> metrics that use percentile. Was there a reason to choose "99th" as >>> opposed >>>>> to "p99" or any other scheme? This is not a criticism, I'm just >>> primarily >>>>> asking >>>>> for consistency's sake. >>>>> >>>>> Thanks again, >>>>> -John >>>>> >>>>> On Wed, May 13, 2020, at 19:19, Sophie Blee-Goldman wrote: >>>>>> Alright, I can get behind adding the min metric for the sake of >>> pretty >>>>>> graphs >>>>>> (and trivial computation). >>>>>> >>>>>> I'm still on the fence regarding the mean (or 50th percentile) but I >>> can >>>>> see >>>>>> how users might expect it and find it a bit disorienting not to >>> have. So >>>>> the >>>>>> updated proposed metrics are >>>>>> >>>>>> >>>>>> - record-staleness-max [ms] >>>>>> - record-staleness-99th [ms] *(99th percentile)* >>>>>> - record-staleness-75th [ms] *(75th percentile)* >>>>>> - record-staleness-avg [ms] *(mean)* >>>>>> - record-staleness-min [ms] >>>>>> >>>>>> >>>>>> On Wed, May 13, 2020 at 4:42 PM John Roesler <vvcep...@apache.org> >>>>> wrote: >>>>>> >>>>>>> Oh boy, I never miss an opportunity to embarrass myself. I guess >>> the >>>>> mean >>>>>>> seems more interesting to me than the median, but neither are as >>>>>>> interesting as the higher percentiles (99th and max). >>>>>>> >>>>>>> Min isn’t really important for any SLAs, but it does round out the >>>>> mental >>>>>>> picture of the distribution. I’ve always graphed min along with the >>>>> other >>>>>>> metrics to help me understand how fast the system can be, which >>> helps >>>>> in >>>>>>> optimization decisions. It’s also a relatively inexpensive metric >>> to >>>>>>> compute, so it might be nice to just throw it in. >>>>>>> >>>>>>> On Wed, May 13, 2020, at 18:18, Sophie Blee-Goldman wrote: >>>>>>>> G1: >>>>>>>> I was considering it as the "end-to-end latency *up* to the >>> specific >>>>>>> task" >>>>>>>> but >>>>>>>> I'm happy with "record-staleness" if that drives the point home >>>>> better. >>>>>>> So >>>>>>>> it's the >>>>>>>> "staleness of the record when it is received by that task" -- >>> will >>>>> update >>>>>>>> the KIP >>>>>>>> >>>>>>>> B1/J: >>>>>>>> I'm struggling to imagine a case where the min would actually be >>>>> useful, >>>>>>>> rather than >>>>>>>> just intellectually interesting. I don't feel strongly that we >>>>> shouldn't >>>>>>>> add it, but that's >>>>>>>> why I didn't include it from the start. Can you enlighten me >>> with an >>>>>>>> example? >>>>>>>> >>>>>>>> I was also vaguely concerned about the overhead of adding >>> multiple >>>>>>>> percentile >>>>>>>> metrics. Do we have any data to indicate what kind of performance >>>>> hit we >>>>>>>> take on >>>>>>>> metrics computation? >>>>>>>> >>>>>>>> Also, not to be too pedantic but the 50th percentile would be the >>>>> median >>>>>>>> not the >>>>>>>> mean. Would you propose to add the mean *and* the 50th >>> percentile, or >>>>>>> just >>>>>>>> one >>>>>>>> of the two? >>>>>>>> >>>>>>>> Thanks all! >>>>>>>> Sophie >>>>>>>> >>>>>>>> On Wed, May 13, 2020 at 3:34 PM John Roesler < >>> vvcep...@apache.org> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello all, and thanks for the KIP, Sophie, >>>>>>>>> >>>>>>>>> Just some comments on the discussion so far: >>>>>>>>> >>>>>>>>> B2/G1: >>>>>>>>> In principle, it shouldn't matter whether we report "spans" or >>>>>>>>> "end-to-end" latency. But in practice, some of the spans are >>> pretty >>>>>>>>> difficult to really measure (like time spent waiting in the >>>>> topics, or >>>>>>>>> time from the event happening to the ETL producer choosing to >>> send >>>>> it, >>>>>>>>> or time spent in send/receive buffers, etc., etc. >>>>>>>>> >>>>>>>>> In other words, it's practically easier to compute spans by >>>>> subtracting >>>>>>>>> e2e latencies than it is to compute e2e latencies by adding >>> spans. >>>>> You >>>>>>>>> can even just consider that the span computation from e2e >>> always >>>>> just >>>>>>>>> involves subtracting two numbers, whereas computing e2e latency >>>>> from >>>>>>>>> spans involves adding _all_ the spans leading up to the end you >>>>> care >>>>>>> about. >>>>>>>>> >>>>>>>>> It seems like people really prefer to have spans when they are >>>>>>> debugging >>>>>>>>> latency problems, whereas e2e latency is a more general >>> measurement >>>>>>>>> that basically every person/application cares about and should >>> be >>>>>>>>> monitoring. >>>>>>>>> >>>>>>>>> Altogether, it really seem to provide more value to more >>> people if >>>>> we >>>>>>>>> report >>>>>>>>> e2e latencies. Regarding "record-staleness" as a name, I think >>> I >>>>> have >>>>>>> no >>>>>>>>> preference, I'd defer to other peoples' intuition. >>>>>>>>> >>>>>>>>> G2: >>>>>>>>> I think the processor-node metric is nice, since the inside of >>> a >>>>> task >>>>>>> can >>>>>>>>> introduce a significant amount of latency in some cases. Plus, >>>>> it's a >>>>>>> more >>>>>>>>> direct measurement, if you really wanted to know (for the >>> purposes >>>>> of >>>>>>> IQ >>>>>>>>> or something) how long it takes source events to "show up" at >>> the >>>>>>> store. >>>>>>>>> >>>>>>>>> I think actually recording it at every processor could be >>>>> expensive, >>>>>>> but we >>>>>>>>> already record a bunch of metrics at the node level. >>>>>>>>> >>>>>>>>> B1: >>>>>>>>> I think 50% could be reasonable to record also. Even if it's a >>> poor >>>>>>> metric >>>>>>>>> for operational purposes, a lot of people might expect to see >>>>> "mean". >>>>>>>>> Actually, >>>>>>>>> I was surprised not to see "min". Is there a reason to leave it >>>>> off? >>>>>>>>> >>>>>>>>> I might suggest: >>>>>>>>> min, mean (50th), 75th, 99th, max >>>>>>>>> >>>>>>>>> B3: >>>>>>>>> I agree we should include late records (though not the ones we >>>>> drop). >>>>>>>>> It may be spiky, but only when there are legitimately some >>> records >>>>>>> with a >>>>>>>>> high end-to-end latency, which is the whole point of these >>> metrics. >>>>>>>>> >>>>>>>>> That's it! I don't think I have any other feedback, other than >>> a >>>>>>> request to >>>>>>>>> also report "min". >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> -John >>>>>>>>> >>>>>>>>> On Wed, May 13, 2020, at 16:58, Guozhang Wang wrote: >>>>>>>>>> Thanks Sophie for the KIP, a few quick thoughts: >>>>>>>>>> >>>>>>>>>> 1) The end-to-end latency includes both the processing >>> latency >>>>> of the >>>>>>>>> task >>>>>>>>>> and the latency spent sitting in intermediate topics. I have >>> a >>>>>>> similar >>>>>>>>>> feeling as Boyang mentioned above that the latency metric of >>> a >>>>> task A >>>>>>>>>> actually measures the latency of the sub-topology up-to but >>> not >>>>>>> including >>>>>>>>>> the processing of A, which is a bit weird. >>>>>>>>>> >>>>>>>>>> Maybe the my feeling comes from the name "latency" itself, >>> since >>>>>>> today we >>>>>>>>>> already have several "latency" metrics already which are >>>>> measuring >>>>>>>>> elapsed >>>>>>>>>> system-time for processing a record / etc, while here we are >>>>>>> comparing >>>>>>>>> the >>>>>>>>>> system wallclock time with the record timestamp. >>>>>>>>>> >>>>>>>>>> Maybe we can consider renaming it as "record-staleness" >>> (note we >>>>>>> already >>>>>>>>>> have a "record-lateness" metric), in which case recording at >>> the >>>>>>>>>> system-time before we start processing the record sounds more >>>>>>> natural. >>>>>>>>>> >>>>>>>>>> 2) With that in mind, I'm wondering if the >>> processor-node-level >>>>> DEBUG >>>>>>>>>> metric is worth to add, given that we already have a >>> task-level >>>>>>>>> processing >>>>>>>>>> latency metric. Basically, a specific node's e2e latency is >>>>> similar >>>>>>> to >>>>>>>>> the >>>>>>>>>> task-level e2e latency + task-level processing latency. >>>>> Personally I >>>>>>>>> think >>>>>>>>>> having a task-level record-staleness metric is sufficient. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Guozhang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, May 13, 2020 at 11:46 AM Sophie Blee-Goldman < >>>>>>>>> sop...@confluent.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> 1. I felt that 50% was not a particularly useful gauge for >>> this >>>>>>>>> specific >>>>>>>>>>> metric, as >>>>>>>>>>> it's presumably most useful at putting an *upper *bound on >>> the >>>>>>> latency >>>>>>>>> you >>>>>>>>>>> can >>>>>>>>>>> reasonably expect to see. I chose percentiles that would >>>>> hopefully >>>>>>>>> give a >>>>>>>>>>> good >>>>>>>>>>> sense of what *most* records will experience, and what >>> *close >>>>> to >>>>>>> all* >>>>>>>>>>> records >>>>>>>>>>> will. >>>>>>>>>>> >>>>>>>>>>> However I'm not married to these specific numbers and >>> could be >>>>>>>>> convinced. >>>>>>>>>>> Would be especially interested in hearing from users on >>> this. >>>>>>>>>>> >>>>>>>>>>> 2. I'm inclined to not include the "hop-to-hop latency" in >>>>> this KIP >>>>>>>>> since >>>>>>>>>>> users >>>>>>>>>>> can always compute it themselves by subtracting the >>> previous >>>>> node's >>>>>>>>>>> end-to-end latency. I guess we could do it either way since >>>>> you can >>>>>>>>> always >>>>>>>>>>> compute one from the other, but I think the end-to-end >>> latency >>>>>>> feels >>>>>>>>> more >>>>>>>>>>> valuable as it's main motivation is not to debug >>> bottlenecks >>>>> in the >>>>>>>>>>> topology but >>>>>>>>>>> to give users a sense of how long it takes arecord to be >>>>> reflected >>>>>>> in >>>>>>>>>>> certain parts >>>>>>>>>>> of the topology. For example this might be useful for users >>>>> who are >>>>>>>>>>> wondering >>>>>>>>>>> roughly when a record that was just produced will be >>> included >>>>> in >>>>>>> their >>>>>>>>> IQ >>>>>>>>>>> results. >>>>>>>>>>> Debugging is just a nice side effect -- but maybe I didn't >>> make >>>>>>> that >>>>>>>>> clear >>>>>>>>>>> enough >>>>>>>>>>> in the KIP's motivation. >>>>>>>>>>> >>>>>>>>>>> 3. Good question, I should address this in the KIP. The >>> short >>>>>>> answer is >>>>>>>>>>> "yes", >>>>>>>>>>> we will include late records. I added a paragraph to the >>> end >>>>> of the >>>>>>>>>>> Proposed >>>>>>>>>>> Changes section explaining the reasoning here, please let >>> me >>>>> know >>>>>>> if >>>>>>>>> you >>>>>>>>>>> have >>>>>>>>>>> any concerns. >>>>>>>>>>> >>>>>>>>>>> 4. Assuming you're referring to the existing metric >>>>>>> "process-latency", >>>>>>>>> that >>>>>>>>>>> metric >>>>>>>>>>> reflects the time for the literal Node#process method to >>> run >>>>>>> whereas >>>>>>>>> this >>>>>>>>>>> metric >>>>>>>>>>> would always be measured relative to the event timestamp. >>>>>>>>>>> >>>>>>>>>>> That said, the naming collision there is pretty confusing >>> so >>>>> I've >>>>>>>>> renamed >>>>>>>>>>> the >>>>>>>>>>> metrics in this KIP to "end-to-end-latency" which I feel >>> better >>>>>>>>> reflects >>>>>>>>>>> the nature >>>>>>>>>>> of the metric anyway. >>>>>>>>>>> >>>>>>>>>>> Thanks for the feedback! >>>>>>>>>>> >>>>>>>>>>> On Wed, May 13, 2020 at 10:21 AM Boyang Chen < >>>>>>>>> reluctanthero...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for the KIP Sophie. Getting the E2E latency is >>>>> important >>>>>>> for >>>>>>>>>>>> understanding the bottleneck of the application. >>>>>>>>>>>> >>>>>>>>>>>> A couple of questions and ideas: >>>>>>>>>>>> >>>>>>>>>>>> 1. Could you clarify the rational of picking 75, 99 and >>> max >>>>>>>>> percentiles? >>>>>>>>>>>> Normally I see cases where we use 50, 90 percentile as >>> well >>>>> in >>>>>>>>> production >>>>>>>>>>>> systems. >>>>>>>>>>>> >>>>>>>>>>>> 2. The current latency being computed is cumulative, I.E >>> if a >>>>>>> record >>>>>>>>> goes >>>>>>>>>>>> through A -> B -> C, then P(C) = T(B->C) + P(B) = >>> T(B->C) + >>>>>>> T(A->B) + >>>>>>>>>>> T(A) >>>>>>>>>>>> and so on, where P() represents the captured latency, >>> and T() >>>>>>>>> represents >>>>>>>>>>>> the time for transiting the records between two nodes, >>>>> including >>>>>>>>>>> processing >>>>>>>>>>>> time. For monitoring purpose, maybe having T(B->C) and >>>>> T(A->B) >>>>>>> are >>>>>>>>> more >>>>>>>>>>>> natural to view as "hop-to-hop latency", otherwise if >>> there >>>>> is a >>>>>>>>> spike in >>>>>>>>>>>> T(A->B), both P(B) and P(C) are affected in the same >>> time. >>>>> In the >>>>>>>>> same >>>>>>>>>>>> spirit, the E2E latency is meaningful only when the >>> record >>>>> exits >>>>>>>>> from the >>>>>>>>>>>> sink as this marks the whole time this record spent >>> inside >>>>> the >>>>>>>>> funnel. Do >>>>>>>>>>>> you think we could have separate treatment for sink >>> nodes and >>>>>>> other >>>>>>>>>>>> nodes, so that other nodes only count the time receiving >>> the >>>>>>> record >>>>>>>>> from >>>>>>>>>>>> last hop? I'm not proposing a solution here, just want to >>>>> discuss >>>>>>>>> this >>>>>>>>>>>> alternative to see if it is reasonable. >>>>>>>>>>>> >>>>>>>>>>>> 3. As we are going to monitor late arrival records as >>> well, >>>>> they >>>>>>>>> would >>>>>>>>>>>> create some really spiky graphs when the out-of-order >>>>> records are >>>>>>>>>>>> interleaving with on time records. Should we also supply >>> a >>>>> smooth >>>>>>>>> version >>>>>>>>>>>> of the latency metrics, or user should just take care of >>> it >>>>> by >>>>>>>>> themself? >>>>>>>>>>>> >>>>>>>>>>>> 4. Regarding this new metrics, we haven't discussed its >>>>> relation >>>>>>>>> with our >>>>>>>>>>>> existing processing latency metrics, could you add some >>>>> context >>>>>>> on >>>>>>>>>>>> comparison and a simple `when to use which` tutorial for >>> the >>>>>>> best? >>>>>>>>>>>> >>>>>>>>>>>> Boyang >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 12, 2020 at 7:28 PM Sophie Blee-Goldman < >>>>>>>>> sop...@confluent.io >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey all, >>>>>>>>>>>>> >>>>>>>>>>>>> I'd like to kick off discussion on KIP-613 which aims >>> to >>>>> add >>>>>>>>> end-to-end >>>>>>>>>>>>> latency metrics to Streams. Please take a look: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Sophie >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> -- Guozhang >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >>
signature.asc
Description: OpenPGP digital signature