danny0405 commented on code in PR #10191: URL: https://github.com/apache/hudi/pull/10191#discussion_r1410089268
########## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ########## @@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession, // and candidate files are obtained from these file slices. lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema) - + // bucket query index + var bucketIds = Option.empty[BitSet] + if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) { + bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters) + } + // record index lazy val (_, recordKeys) = recordLevelIndex.filterQueriesWithRecordKey(queryFilters) - if (!isMetadataTableEnabled || !isDataSkippingEnabled) { + + // index chose + if (bucketIndex.isIndexAvailable && bucketIds.isDefined && bucketIds.get.cardinality() > 0) { + Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get)) Review Comment: We are just doing two level of pruning/skipping here: 1. file group skipping with bucket index; (so that the overall candicates was pruned before next step) 2. file skipping within a file group These two steps should be othogonal and we could have both, maybe RLI does not make sense when hash keys equals primary keys, but when hash keys are sub-set of record keys, we can still have the gains. And if there are some other predicates like max/min from the column stats, we can even skip a very special file then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org