+1?? It feels great, but in actual business scenarios, due to some data
abnormalities, the event time will be inaccurate.
This situation seems to affect the monitoring of this indicator?
Best??
liujinhui
------------------ ???????? ------------------
??????:
"dev"
<[email protected]>;
????????: 2021??2??3??(??????) ????9:55
??????: "dev"<[email protected]>;
????: [DISCUSS] Measure latency by storing event time in WriteStatus
Hi all,
It is a common requirement to measure data latency in Hudi tables. There
isn't a metric reporting latency directly from HoodieMetrics. I'm proposing
to measure the latency for each commit by this formula
latency = commitTime + commitDuration - earliest event time of the incoming
records
There are 4 major parts to make this available (thanks to Vinoth's hints)
- To store the earliest event time, we need to extract the event times from
Hoodie payloads. We can make it available in
org.apache.hudi.common.model.DefaultHoodieRecordPayload#getMetadata()
- then org.apache.hudi.client.WriteStatus#markSuccess() can perform the
comparison and store the min value
in org.apache.hudi.common.model.HoodieWriteStat
- org.apache.hudi.common.model.HoodieCommitMetadata can then aggregate all
the min values and returns a global min of all the partitions.
- lastly, in org.apache.hudi.metrics.HoodieMetrics#updateCommitMetrics we
can compute the latency using the formula above
I have a draft implementation shown in the diff
https://github.com/apache/hudi/compare/master...xushiyan:measure-latency
I think this metric will be commonly used so I made those changes on
default classes like DefaultHoodieRecordPayload and HoodieWriteStat. Hope
to get some early feedback on the implementation. Thank you.
Best,
Raymond