Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

via GitHub Tue, 28 Nov 2023 21:51:19 -0800


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1408777462



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
     //       and candidate files are obtained from these file slices.
 
     lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+    // bucket query index
+    var bucketIds = Option.empty[BitSet]
+    if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+      bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+    }
+    // record index
     lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-    if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+    // index chose
+    if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+      Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   > And the bucket index file skipping should be orthoganal to the other 
skipping strategies, let's just try the bucket skipping first then continue 
with the other strategies.
   
   good idea, but I think if bucket query index can work, it can achieve a good 
effect itself.
   
   there are a few considerations here. The record-level index itself serves as 
the primary key index, and combined with bucket indexing, I think there may not 
be a better effect. Another consideration is column indexing. This index itself 
has certain requirements for data layout to achieve good results. However, for 
bucket tables, sorting the layout is not feasible. Therefore, I think combining 
it with bucket indexing will not yield very favorable results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

Reply via email to