danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410089268


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
     //       and candidate files are obtained from these file slices.
 
     lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+    // bucket query index
+    var bucketIds = Option.empty[BitSet]
+    if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+      bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+    }
+    // record index
     lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-    if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+    // index chose
+    if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+      Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   We are just doing two level of pruning/skipping here:
   
   1. file group skipping with bucket index; (so that the overall candicates 
was pruned before next step)
   2. file skipping within a file group
   
   These two steps should be othogonal and we could have both, maybe RLI does 
not make sense when hash keys equals primary keys, but when hash keys are 
sub-set of record keys, we can still have the gains.
   
   And if there are some other predicates like max/min from the column stats, 
we can even skip a very special file then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to