Hi all,

It is a common requirement to measure data latency in Hudi tables. There
isn't a metric reporting latency directly from HoodieMetrics. I'm proposing
to measure the latency for each commit by this formula

latency = commitTime + commitDuration - earliest event time of the incoming
records

There are 4 major parts to make this available (thanks to Vinoth's hints)

- To store the earliest event time, we need to extract the event times from
Hoodie payloads. We can make it available in
org.apache.hudi.common.model.DefaultHoodieRecordPayload#getMetadata()

- then org.apache.hudi.client.WriteStatus#markSuccess() can perform the
comparison and store the min value
in org.apache.hudi.common.model.HoodieWriteStat

- org.apache.hudi.common.model.HoodieCommitMetadata can then aggregate all
the min values and returns a global min of all the partitions.

- lastly, in org.apache.hudi.metrics.HoodieMetrics#updateCommitMetrics we
can compute the latency using the formula above

I have a draft implementation shown in the diff
https://github.com/apache/hudi/compare/master...xushiyan:measure-latency

I think this metric will be commonly used so I made those changes on
default classes like DefaultHoodieRecordPayload and HoodieWriteStat. Hope
to get some early feedback on the implementation. Thank you.

Best,
Raymond

Reply via email to