Re: [I] use commit_time in the WHERE STATEMENT to optimize the incremental query [hudi]

via GitHub Sat, 23 May 2026 01:26:16 -0700


codope commented on issue #14869:
URL: https://github.com/apache/hudi/issues/14869#issuecomment-4524807700

@praneethkaturi thanks for digging into this issue. I think Option 2 makes
most sense to me and also a very useful contribution:

> 2. Just make the existing query faster. Same SQL stays a snapshot query,
but Hudi uses the _hoodie_commit_time predicate to skip files that can't
possibly match.

The JIRA issue is old and at the time i don't think we had data skipping
based on meta columns. Option 2 is a natural extension of what Hudi already
does. `_hoodie_commit_time` is already in
[META_COLS_TO_ALWAYS_INDEX](https://github.com/apache/hudi/blob/e299b84b10b5ccd4ee5c75e541f0109a85549d7a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L1613),
so the column stats partition of the metadata table already tracks min/max for
it per file. The work is wiring HoodieFileIndex / ColumnStatsIndexSupport (and
PartitionStatsIndexSupport) to honor a `_hoodie_commit_time` predicate the way
they honor any other range predicate, and then verifying the same applies to
log files for MOR. This helps Hudi's two core workloads simultaneously:
snapshot reads with a "since" filter (very common in batch ETL backfills and
audit queries) become file-pruned, and incremental reads with additional
commit-time filters get tighter pruning on top of the timeline-driven file
selection.

Regarding engine scope, let's start with Spark. The metadata table
column_stats path and `HoodieFileIndex` are already very mature there, and
that's where the ticket originated (DeltaStreamer). Flink should follow as a
separate PR; the underlying metadata is engine-independent.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] use commit_time in the WHERE STATEMENT to optimize the incremental query [hudi]

Reply via email to