codope commented on issue #14869: URL: https://github.com/apache/hudi/issues/14869#issuecomment-4524807700
@praneethkaturi thanks for digging into this issue. I think Option 2 makes most sense to me and also a very useful contribution: > 2. Just make the existing query faster. Same SQL stays a snapshot query, but Hudi uses the _hoodie_commit_time predicate to skip files that can't possibly match. The JIRA issue is old and at the time i don't think we had data skipping based on meta columns. Option 2 is a natural extension of what Hudi already does. `_hoodie_commit_time` is already in [META_COLS_TO_ALWAYS_INDEX](https://github.com/apache/hudi/blob/e299b84b10b5ccd4ee5c75e541f0109a85549d7a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L1613), so the column stats partition of the metadata table already tracks min/max for it per file. The work is wiring HoodieFileIndex / ColumnStatsIndexSupport (and PartitionStatsIndexSupport) to honor a `_hoodie_commit_time` predicate the way they honor any other range predicate, and then verifying the same applies to log files for MOR. This helps Hudi's two core workloads simultaneously: snapshot reads with a "since" filter (very common in batch ETL backfills and audit queries) become file-pruned, and incremental reads with additional commit-time filters get tighter pruning on top of the timeline-driven file selection. Regarding engine scope, let's start with Spark. The metadata table column_stats path and `HoodieFileIndex` are already very mature there, and that's where the ticket originated (DeltaStreamer). Flink should follow as a separate PR; the underlying metadata is engine-independent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
