ableegoldman commented on pull request #9094:
URL: https://github.com/apache/kafka/pull/9094#issuecomment-670281342


   Hey @guozhangwang 
   Thanks for the review. You have some high-level questions so I'll try to 
answer them here but let me know if you want to sync offline.
   
   > the only scenario that task-level e2e latency be different from store 
level is would be suppression itself
   
   Yes, since the task-level metrics are at INFO level and use the cached 
system time for performance reasons.
   
   > So what about we only define it at the suppression node as a node-level 
metric still, and we record it whenever a record is suppressed for emitting at 
the moment
   
   Well, we'd want to record the latency at the time it was suppressed, but 
also at the time it was actually emitted, right? The former represents the e2e 
latency upstream of the suppression and the latter represents the e2e latency 
downstream of it, both of which are useful if you're querying state stores on 
both ends. But isn't this exactly the same thing as recording it at the task 
level? Especially if we use the cached time for all INFO level metrics, it 
should be literally the same.
   Technically that's only true if you have a single suppression per 
subtopology, but IIUC that's where we were headed anyways (and presumably few 
people have multiple suppressions per subtopology to begin with)
   
   > I'm a bit leaning towards adding the task level metrics first
   
   The task level metrics have already been added. They're in 2.6 so that ship 
has sailed 🙂 .
   
   Just to clarify, are you proposing the suppression-level metrics instead of 
the task-level metrics or instead of the store-level metrics? As noted above, 
they seem equivalent to the task-level metrics. On the other hand, the 
store-level metrics go beyond both of these: we introduced TRACE metrics so we 
could measure e2e latency relative to the actual (not cached) system time, 
without feeling guilty about the performance hit. So the store level metrics 
will give a more fine-grained view of the actual e2e latency and reveal 
intra-subtopology latencies, not just at the input/output. For example if a 
user makes multiple accesses to different stores within a processor, or remote 
API calls (not that we support this, but it seems unfortunately common).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to