codope commented on issue #14869:
URL: https://github.com/apache/hudi/issues/14869#issuecomment-4524807700

   @praneethkaturi thanks for digging into this issue. I think Option 2 makes 
most sense to me and also a very useful contribution:
   
   > 2. Just make the existing query faster. Same SQL stays a snapshot query, 
but Hudi uses the _hoodie_commit_time predicate to skip files that can't 
possibly match.
   
   The JIRA issue is old and at the time i don't think we had data skipping 
based on meta columns. Option 2 is a natural extension of what Hudi already 
does. `_hoodie_commit_time` is already in 
[META_COLS_TO_ALWAYS_INDEX](https://github.com/apache/hudi/blob/e299b84b10b5ccd4ee5c75e541f0109a85549d7a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L1613),
 so the column stats partition of the metadata table already tracks min/max for 
it per file. The work is wiring HoodieFileIndex / ColumnStatsIndexSupport (and 
PartitionStatsIndexSupport) to honor a `_hoodie_commit_time` predicate the way 
they honor any other range predicate, and then verifying the same applies to 
log files for MOR. This helps Hudi's two core workloads simultaneously: 
snapshot reads with a "since" filter (very common in batch ETL backfills and 
audit queries) become file-pruned, and incremental reads with additional 
commit-time filters get tighter pruning on top of the timeline-driven file 
 selection.
   
   Regarding engine scope, let's start with Spark. The metadata table 
column_stats path and `HoodieFileIndex` are already very mature there, and 
that's where the ticket originated (DeltaStreamer). Flink should follow as a 
separate PR; the underlying metadata is engine-independent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to