ableegoldman commented on pull request #9094: URL: https://github.com/apache/kafka/pull/9094#issuecomment-670281342
Hey @guozhangwang Thanks for the review. You have some high-level questions so I'll try to answer them here but let me know if you want to sync offline. > the only scenario that task-level e2e latency be different from store level is would be suppression itself Yes, since the task-level metrics are at INFO level and use the cached system time for performance reasons. > So what about we only define it at the suppression node as a node-level metric still, and we record it whenever a record is suppressed for emitting at the moment Well, we'd want to record the latency at the time it was suppressed, but also at the time it was actually emitted, right? The former represents the e2e latency upstream of the suppression and the latter represents the e2e latency downstream of it, both of which are useful if you're querying state stores on both ends. But isn't this exactly the same thing as recording it at the task level? Especially if we use the cached time for all INFO level metrics, it should be literally the same. Technically that's only true if you have a single suppression per subtopology, but IIUC that's where we were headed anyways (and presumably few people have multiple suppressions per subtopology to begin with) > I'm a bit leaning towards adding the task level metrics first The task level metrics have already been added. They're in 2.6 so that ship has sailed 🙂 . Just to clarify, are you proposing the suppression-level metrics instead of the task-level metrics or instead of the store-level metrics? As noted above, they seem equivalent to the task-level metrics. On the other hand, the store-level metrics go beyond both of these: we introduced TRACE metrics so we could measure e2e latency relative to the actual (not cached) system time, without feeling guilty about the performance hit. So the store level metrics will give a more fine-grained view of the actual e2e latency and reveal intra-subtopology latencies, not just at the input/output. For example if a user makes multiple accesses to different stores within a processor, or remote API calls (not that we support this, but it seems unfortunately common). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org