shangxinli commented on issue #17512:
URL: https://github.com/apache/hudi/issues/17512#issuecomment-4488889808
## Phased implementation plan
**Phase 1 — Per-partition exposure (low risk, additive only).**
- Add `HoodieCommitMetadata.getMinAndMaxEventTimePerPartition()` as a pure
aggregation over `partitionToWriteStats`.
- Decouple `isTrackingEventTimeWatermark` from `EVENT_TIME_ORDERING` in
`HoodieWriteHandle` and the Flink `HoodieRowDataCreateHandle` equivalent.
- No new config, no avro schema change, no behavior change for tables that
don't set the flag.
- Tests: unit tests for the rollup; integration test confirming COW +
`COMMIT_TIME_ORDERING` + event-time field now populates min/max on each write
stat.
- Ship target: 1.1.x patch / 1.2.0.
**Phase 2 — Upstream propagation (the substantive new piece).**
- New config: `hoodie.write.track.event.time.propagate.from.upstream`
(default `false`, advanced).
- Wire into `HoodieIncrSource` first (most common derived-table pattern),
then Spark SQL Hudi source, then Flink source.
- Inheritance rule: per destination partition, take min/max across all
upstream partitions that contributed to it; if multiple upstream commits feed
one downstream commit, fold across them.
- Edge cases to settle before coding:
- Deletes / clean — do they count as a freshness signal, or skip?
- Inserts with no upstream lineage — no-op (don't fabricate a watermark).
- Back-fills — propagation must not regress watermark; needs explicit
max-only semantics or a per-write skip flag.
- Tests: end-to-end two-hop pipeline (Kafka → Hudi raw → Hudi derived)
asserting the derived commit inherits raw watermarks per-partition.
- Ship target: 1.2.0 or 1.3.0 depending on review pace.
**Phase 3 — Naming and docs (polish).**
- Alias `hoodie.write.track.event.time.watermark` →
`hoodie.write.track.freshness.enable` if a more user-facing name is wanted;
deprecate the old key over two minor versions, don't remove.
- Public doc page covering the per-partition API, propagation behavior, and
known limits (no event-time field → no watermark; merge-mode behavior;
back-fill semantics).
- Ship alongside whichever release carries Phase 2.
**Sequencing rationale.** Phase 1 alone solves the original RFC's "expose
per-partition freshness" ask with effectively zero risk and can land
independently. If reviewers push back on propagation, Phases 1 + 3 still close
most of the gap and Phase 2 can be revisited as its own RFC without blocking
the observability win.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]