Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-20 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119749841

   
   ## CI report:
   
   * d404396954992d5baa2499715705445e3f5d82f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23997)
 
   * 1528c7b13bfe56dfb5f0a1527e628508915d32d5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24001)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-20 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119740236

   
   ## CI report:
   
   * d404396954992d5baa2499715705445e3f5d82f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23997)
 
   * 1528c7b13bfe56dfb5f0a1527e628508915d32d5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119695952

   
   ## CI report:
   
   * d404396954992d5baa2499715705445e3f5d82f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23997)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119689447

   
   ## CI report:
   
   * d404396954992d5baa2499715705445e3f5d82f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23997)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119685454

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119627299

   
   ## CI report:
   
   * d404396954992d5baa2499715705445e3f5d82f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23997)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119591370

   
   ## CI report:
   
   * e3223a6ef0dd865dcbd672cca9f5fb979f80ddc5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23984)
 
   * d404396954992d5baa2499715705445e3f5d82f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23997)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2119586528

   
   ## CI report:
   
   * e3223a6ef0dd865dcbd672cca9f5fb979f80ddc5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23984)
 
   * d404396954992d5baa2499715705445e3f5d82f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1606187786


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -344,7 +345,7 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-if (isMetadataTableEnabled && isDataSkippingEnabled) {
+if (isDataSkippingEnabled) {

Review Comment:
   bucket query purge don't need mdt, every index has this condition, remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1606136424


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestDataSkippingQuery.scala:
##
@@ -113,4 +113,82 @@ class TestDataSkippingQuery extends HoodieSparkSqlTestBase 
{
   )
 }
   }
+
+  test("bucket index query") {
+// table bucket prop can not read in query sql now, so need set these conf

Review Comment:
   table bucket prop can not be read in the query sql now, so need to set these 
configs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1606136024


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+import org.slf4j.LoggerFactory
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(spark: SparkSession,
+ metadataConfig: HoodieMetadataConfig,
+ metaClient: HoodieTableMetaClient)
+  extends SparkBaseIndexSupport (spark, metadataConfig, metaClient){
+
+  private val log = LoggerFactory.getLogger(getClass)
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = new 
TableSchemaResolver(metaClient).getTableAvroSchema(false)
+
+  override def getIndexName: String = "BUCKET"
+
+  /**
+   * Return true if table can use bucket index
+   * - has bucket hash field
+   * - table is bucket index writer
+   * - only support simple bucket engine
+   */
+  def isIndexAvailable: Boolean = {
+indexBucketHashFieldsOpt.isDefined &&
+  metadataConfig.getStringOrDefault(HoodieIndexConfig.INDEX_TYPE, 
"").equalsIgnoreCase(IndexType.BUCKET.name()) &&
+  
metadataConfig.getStringOrDefault(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE).equalsIgnoreCase(HoodieIndex.BucketIndexEngineType.SIMPLE.name())
 &&
+  metadataConfig.getBooleanOrDefault(HoodieIndexConfig.BUCKET_QUERY_INDEX)
+  }
+
+  override def invalidateCaches(): Unit = {
+// no caches for this index type, do nothing
+  }
+
+  override def computeCandidateFileNames(fileIndex: HoodieFileIndex,
+ queryFilters: Seq[Expression],
+ queryReferencedColumns: Seq[String],
+ prunedPartitionsAndFileSlices: 
Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])],
+ shouldPushDownFilesFilter: Boolean): 
Option[Set[String]] = {
+
+val bucketIdsBitMapByFilter = 
filterQueriesWithBucketHashField(queryFilters)
+
+if (bucketIdsBitMapByFilter.isDefined && 
bucketIdsBitMapByFilter.get.cardinality() > 0) {
+  val allFilesName = getPrunedFileNames(prunedPartitionsAndFileSlices)
+  Option.apply(getCandidateFiles(allFilesName, 
bucketIdsBitMapByFilter.get))
+} else {
+  Option.empty
+}
+  }
+
+  def getCandidateFiles(allFilesName: Set[String], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (fileName <- allFilesName) {
+  val fileId = FSUtils.getFileIdFromFileName(fileName)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += fileName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getIntOrDefault(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+if (indexBucketHashFieldsOpt.isEmpty || 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-19 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1606135568


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##
@@ -334,6 +334,11 @@ public class HoodieIndexConfig extends HoodieConfig {
   .withDocumentation("Only applies when #recordIndexUseCaching is set. 
Determine what level of persistence is used to cache input RDDs. "
   + "Refer to org.apache.spark.storage.StorageLevel for different 
values");
 
+  public static final ConfigProperty BUCKET_QUERY_INDEX = 
ConfigProperty
+  .key("hoodie.bucket.query.index")

Review Comment:
   `hoodie.bucket.query.index` -> `hoodie.bucket.index.query.pruning` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-17 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2117790278

   
   ## CI report:
   
   * e3223a6ef0dd865dcbd672cca9f5fb979f80ddc5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23984)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-17 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2117577092

   
   ## CI report:
   
   * ef29826c5973ac624100b38717c685d3a1059fe2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23976)
 
   * e3223a6ef0dd865dcbd672cca9f5fb979f80ddc5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23984)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-17 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2117560864

   
   ## CI report:
   
   * ef29826c5973ac624100b38717c685d3a1059fe2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23976)
 
   * e3223a6ef0dd865dcbd672cca9f5fb979f80ddc5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-17 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1604930648


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+import org.slf4j.LoggerFactory
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LoggerFactory.getLogger(getClass)
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")

Review Comment:
   good catch, fix it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-17 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-211750

   @danny0405 yes, is ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-17 Thread via GitHub


danny0405 commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2117101176

   @KnightChess Is this patch ready for review again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-16 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2115660930

   
   ## CI report:
   
   * ef29826c5973ac624100b38717c685d3a1059fe2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23976)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-16 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2115510319

   
   ## CI report:
   
   * cd7e0060a34bd5b45db5c5a77e7a807c8b1104ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22091)
 
   * ef29826c5973ac624100b38717c685d3a1059fe2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23976)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-05-16 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-2115491834

   
   ## CI report:
   
   * cd7e0060a34bd5b45db5c5a77e7a807c8b1104ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22091)
 
   * ef29826c5973ac624100b38717c685d3a1059fe2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-02-01 Thread via GitHub


lookingUpAtTheSky commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1474301587


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+import org.slf4j.LoggerFactory
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LoggerFactory.getLogger(getClass)
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")

Review Comment:
   the schema refered here used in HoodieFileIndex, is converted from avro 
scheme to struct type.
   is it necessary to covert avro scheme to struct , and then convert it back?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-22 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1904074088

   @nsivabalan @danny0405 @yihua  hi, ci all sucess, can you help revie it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-22 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1903995552

   
   ## CI report:
   
   * cd7e0060a34bd5b45db5c5a77e7a807c8b1104ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22091)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-22 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1903697800

   
   ## CI report:
   
   * fef6c24e5290817312d2520c8af84223ccac8a08 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21967)
 
   * cd7e0060a34bd5b45db5c5a77e7a807c8b1104ca Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22091)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-22 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1903683175

   
   ## CI report:
   
   * fef6c24e5290817312d2520c8af84223ccac8a08 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21967)
 
   * cd7e0060a34bd5b45db5c5a77e7a807c8b1104ca UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-15 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1892229587

   
   ## CI report:
   
   * fef6c24e5290817312d2520c8af84223ccac8a08 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21967)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-15 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1892040035

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   * fef6c24e5290817312d2520c8af84223ccac8a08 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21967)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-15 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1892026842

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   * fef6c24e5290817312d2520c8af84223ccac8a08 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-15 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1452270793


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")

Review Comment:
   this does not involve serialization, so I think it is no problem in theory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-15 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1452269274


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")
+
+  /**
+   * the configured bucket field for the table
+   * table may not set bucket fields, it will use record key fields
+   */
+  private val indexBucketHashFieldsOpt: Option[java.util.List[String]] = {
+val bucketHashFields = 
metadataConfig.getStringOrDefault(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD,
+  metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS))
+if (bucketHashFields == null || bucketHashFields.isEmpty) {
+  Option.apply(null)
+} else {
+  
Option.apply(JavaConverters.seqAsJavaListConverter(bucketHashFields.split(",")).asJava)
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+if (indexBucketHashFieldsOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  var matchedBuckets: BitSet = null
+  if (indexBucketHashFieldsOpt.get.size == 1) {
+matchedBuckets = 
getBucketsBySingleHashFields(queryFilters.reduce(And), 
indexBucketHashFieldsOpt.get.get(0), bucketNumber)
+  } else {
+matchedBuckets = getBucketsByMultipleHashFields(queryFilters,
+  
JavaConverters.asScalaBufferConverter(indexBucketHashFieldsOpt.get).asScala.toSet,
 bucketNumber)
+  }
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("the query predicates does not specify equality for all the 
hashing fields, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  // multiple hash fields only support Equality expression by And
+  private def getBucketsByMultipleHashFields(queryFilters: Seq[Expression], 
indexBucketHashFields: Set[String], numBuckets: Int): BitSet = {
+val hashValuePairs = queryFilters.map(expr => getEqualityFieldPair(expr, 
indexBucketHashFields)).filter(pair => pair != null)

Review Comment:
   yes, expression will be split use 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-13 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1451646344


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   still think the optimization should be orthogonal to the others.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-13 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1451646235


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")
+
+  /**
+   * the configured bucket field for the table
+   * table may not set bucket fields, it will use record key fields
+   */
+  private val indexBucketHashFieldsOpt: Option[java.util.List[String]] = {
+val bucketHashFields = 
metadataConfig.getStringOrDefault(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD,
+  metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS))
+if (bucketHashFields == null || bucketHashFields.isEmpty) {
+  Option.apply(null)
+} else {
+  
Option.apply(JavaConverters.seqAsJavaListConverter(bucketHashFields.split(",")).asJava)
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+if (indexBucketHashFieldsOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  var matchedBuckets: BitSet = null
+  if (indexBucketHashFieldsOpt.get.size == 1) {
+matchedBuckets = 
getBucketsBySingleHashFields(queryFilters.reduce(And), 
indexBucketHashFieldsOpt.get.get(0), bucketNumber)
+  } else {
+matchedBuckets = getBucketsByMultipleHashFields(queryFilters,
+  
JavaConverters.asScalaBufferConverter(indexBucketHashFieldsOpt.get).asScala.toSet,
 bucketNumber)
+  }
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("the query predicates does not specify equality for all the 
hashing fields, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  // multiple hash fields only support Equality expression by And
+  private def getBucketsByMultipleHashFields(queryFilters: Seq[Expression], 
indexBucketHashFields: Set[String], numBuckets: Int): BitSet = {
+val hashValuePairs = queryFilters.map(expr => getEqualityFieldPair(expr, 
indexBucketHashFields)).filter(pair => pair != null)

Review Comment:
   Do we validate the formation of the `queryFilters`, should be conjunction 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-13 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1451646072


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")
+
+  /**
+   * the configured bucket field for the table
+   * table may not set bucket fields, it will use record key fields

Review Comment:
   `Returns the configured bucket field for the table, will fall back to  
record key fields if the bucket fields are not set up.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-13 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1451644597


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")

Review Comment:
   Not sure whether the namespace `record` is compatible with Hudi.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-13 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1451644472


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);

Review Comment:
   log4j logger is forbidden, should use slf4j.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-13 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1451644402


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##
@@ -334,6 +334,11 @@ public class HoodieIndexConfig extends HoodieConfig {
   .withDocumentation("Only applies when #recordIndexUseCaching is set. 
Determine what level of persistence is used to cache input RDDs. "
   + "Refer to org.apache.spark.storage.StorageLevel for different 
values");
 
+  public static final ConfigProperty BUCKET_QUERY_INDEX = 
ConfigProperty
+  .key("hoodie.bucket.query.index")

Review Comment:
   I don't think we should support changing the bucket numer for a table, user 
must do a rewrite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1887322570

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1887214366

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1887202137

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


KnightChess closed pull request #10191: [HUDI-6207] spark support bucket index 
query for table with bucket index
URL: https://github.com/apache/hudi/pull/10191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886947715

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886721128

   
   ## CI report:
   
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21920)
 
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886653140

   
   ## CI report:
   
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21920)
 
   * 47f54ad61bb63148eb32041921ea6eaa725566bf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886624741

   
   ## CI report:
   
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886611275

   
   ## CI report:
   
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-10 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886555643

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886475580

   
   ## CI report:
   
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-10 Thread via GitHub


KnightChess closed pull request #10191: [HUDI-6207] spark support bucket index 
query for table with bucket index
URL: https://github.com/apache/hudi/pull/10191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886178722

   
   ## CI report:
   
   * db7a22bac00d2670c5103ca28434fb9a2e1d1256 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21839)
 
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1886172344

   
   ## CI report:
   
   * db7a22bac00d2670c5103ca28434fb9a2e1d1256 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21839)
 
   * 2bc4ba3eac8d086da0ae5884bb0a536e3ee7957e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878999084

   
   ## CI report:
   
   * db7a22bac00d2670c5103ca28434fb9a2e1d1256 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21839)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878762919

   
   ## CI report:
   
   * 6f3cd06ffdd5f8df8e12120b7f4685b2294d1f7f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21835)
 
   * db7a22bac00d2670c5103ca28434fb9a2e1d1256 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21839)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878752192

   
   ## CI report:
   
   * 6f3cd06ffdd5f8df8e12120b7f4685b2294d1f7f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21835)
 
   * db7a22bac00d2670c5103ca28434fb9a2e1d1256 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878727118

   @voonhous hi, I have made some modifications according to your suggestions 
@danny0405 I have support multiple hash fields according to your advice. If you 
have other questions or suggestions, I will be very happy to adjust further


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1442914368


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericData
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.hudi.keygen.KeyGenerator
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig, schema: 
StructType) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  private val keyGenerator =
+HoodieSparkKeyGeneratorFactory.createKeyGenerator(metadataConfig.getProps)
+
+  private lazy val avroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(schema, "record", "")
+
+  /**
+   * the configured bucket field for the table
+   */
+  private val indexBucketHashFieldsOpt: Option[java.util.List[String]] = {
+val bucketHashFields = 
metadataConfig.getStringOrDefault(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD,
+  metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS))
+if (bucketHashFields == null || bucketHashFields.isEmpty) {
+  Option.apply(null)
+} else {
+  
Option.apply(JavaConverters.seqAsJavaListConverter(bucketHashFields.split(",")).asJava)
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+if (indexBucketHashFieldsOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  var matchedBuckets: BitSet = null
+  if (indexBucketHashFieldsOpt.get.size == 1) {
+matchedBuckets = 
getBucketsBySingleHashFields(queryFilters.reduce(And), 
indexBucketHashFieldsOpt.get.get(0), bucketNumber)
+  } else {
+matchedBuckets = getBucketsByMultipleHashFields(queryFilters,
+  
JavaConverters.asScalaBufferConverter(indexBucketHashFieldsOpt.get).asScala.toSet,
 bucketNumber)
+  }
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("the query predicates does not specify equality for all the 
hashing fields, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  // multiple hash fields only support Equality expression by And
+  private def getBucketsByMultipleHashFields(queryFilters: Seq[Expression], 
indexBucketHashFields: Set[String], numBuckets: Int): BitSet = {
+val hashValuePairs = queryFilters.map(expr => getEqualityFieldPair(expr, 
indexBucketHashFields)).filter(pair => pair != null)
+val matchedBuckets = new BitSet(numBuckets)
+if (hashValuePairs.size != indexBucketHashFields.size) {
+  matchedBuckets.setUntil(numBuckets)
+} 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1442911322


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestBucketIndexQuery.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+class TestBucketIndexQuery extends HoodieSparkSqlTestBase {
+
+  test("bucket index query etl") {
+// table bucket prop can not read in query sql now, so need set these conf
+withSQLConf("hoodie.enable.data.skipping" -> "true",
+  "hoodie.bucket.index.hash.field" -> "id",
+  "hoodie.bucket.index.num.buckets" -> "20",
+  "hoodie.index.type" -> "BUCKET") {
+  withTempDir { tmp =>
+val tableName = generateTableName
+// Create a partitioned table
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  dt string,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | tblproperties (
+ | primaryKey = 'id,name',
+ | preCombineField = 'ts',
+ | hoodie.index.type = 'BUCKET',
+ | hoodie.bucket.index.hash.field = 'id',
+ | hoodie.bucket.index.num.buckets = '20')
+ | partitioned by (dt)
+ | location '${tmp.getCanonicalPath}'
+   """.stripMargin)
+
+spark.sql(
+  s"""
+ | insert into $tableName values
+ | (1, 'a1', 10, 1000, "2021-01-05"),
+ | (2, 'a2', 20, 2000, "2021-01-06"),
+ | (3, 'a3', 30, 3000, "2021-01-07")
+  """.stripMargin)
+
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
= 1")(
+  Seq(1, "a1", 10.0, 1000, "2021-01-05")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
= 1 and name = 'a1'")(
+  Seq(1, "a1", 10.0, 1000, "2021-01-05")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
= 2 or id = 5")(
+  Seq(2, "a2", 20.0, 2000, "2021-01-06")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
in (2, 3)")(
+  Seq(2, "a2", 20.0, 2000, "2021-01-06"),
+  Seq(3, "a3", 30.0, 3000, "2021-01-07")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
!= 4")(
+  Seq(1, "a1", 10.0, 1000, "2021-01-05"),
+  Seq(2, "a2", 20.0, 2000, "2021-01-06"),
+  Seq(3, "a3", 30.0, 3000, "2021-01-07")
+)
+spark.sql("set hoodie.bucket.query.index = false")
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
= 1")(
+  Seq(1, "a1", 10.0, 1000, "2021-01-05")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
= 1 and name = 'a1'")(
+  Seq(1, "a1", 10.0, 1000, "2021-01-05")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
= 2 or id = 5")(
+  Seq(2, "a2", 20.0, 2000, "2021-01-06")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
in (2, 3)")(
+  Seq(2, "a2", 20.0, 2000, "2021-01-06"),
+  Seq(3, "a3", 30.0, 3000, "2021-01-07")
+)
+checkAnswer(s"select id, name, price, ts, dt from $tableName where id 
!= 4")(
+  Seq(1, "a1", 10.0, 1000, "2021-01-05"),
+  Seq(2, "a2", 20.0, 2000, "2021-01-06"),
+  Seq(3, "a3", 30.0, 3000, "2021-01-07")
+)
+spark.sql("set hoodie.bucket.query.index = true")

Review Comment:
   now this test file only contain one test ut, if has other, make sure 
sparkConf constant, like `withSQLConf` method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878345737

   
   ## CI report:
   
   * 6f3cd06ffdd5f8df8e12120b7f4685b2294d1f7f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21835)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-05 Thread via GitHub


KnightChess closed pull request #10191: [HUDI-6207] spark support bucket index 
query for table with bucket index
URL: https://github.com/apache/hudi/pull/10191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878224053

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   * 6f3cd06ffdd5f8df8e12120b7f4685b2294d1f7f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21835)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878184333

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   * 6f3cd06ffdd5f8df8e12120b7f4685b2294d1f7f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878178397

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878172345

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878169603

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878068658

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878064114

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1878041193

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1877258909

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1877246188

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1877183957

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1877149471

   
   ## CI report:
   
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1876984521

   
   ## CI report:
   
   * 5fb0402726344bcfee5cd86dd82eccf16fe68b58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21809)
 
   * ca17448d2f0104f93e0e80669ed4156118d29f58 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-04 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1876930389

   
   ## CI report:
   
   * 5fb0402726344bcfee5cd86dd82eccf16fe68b58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21809)
 
   * ca17448d2f0104f93e0e80669ed4156118d29f58 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-03 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1875336727

   
   ## CI report:
   
   * 5fb0402726344bcfee5cd86dd82eccf16fe68b58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21809)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-03 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1875211762

   
   ## CI report:
   
   * 3d696f17a059e11e23c0bfe0a948c8ae993b5f8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21719)
 
   * 5fb0402726344bcfee5cd86dd82eccf16fe68b58 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21809)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-03 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1875201878

   
   ## CI report:
   
   * 3d696f17a059e11e23c0bfe0a948c8ae993b5f8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21719)
 
   * 5fb0402726344bcfee5cd86dd82eccf16fe68b58 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-02 Thread via GitHub


KnightChess opened a new pull request, #10191:
URL: https://github.com/apache/hudi/pull/10191

   ### Change Logs
   
   spark support query filter use bucket field if a bucket table query with 
appropriate expression( = 、in、and、or)
   
   ### Impact
   
   impore table query performance when use spark
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-02 Thread via GitHub


KnightChess closed pull request #10191: [HUDI-6207] spark support bucket index 
query for table with bucket index
URL: https://github.com/apache/hudi/pull/10191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870457354

   
   ## CI report:
   
   * 3d696f17a059e11e23c0bfe0a948c8ae993b5f8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21719)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870374926

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 3d696f17a059e11e23c0bfe0a948c8ae993b5f8d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21719)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870368101

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 3d696f17a059e11e23c0bfe0a948c8ae993b5f8d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870363178

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870361300

   
   ## CI report:
   
   * 7c6d62fd030999f519b43112cda3ce984b80dfa3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21717)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870312901

   
   ## CI report:
   
   * 53efe25403ee38f26cf91633a4dc6adca17c5380 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21715)
 
   * 7c6d62fd030999f519b43112cda3ce984b80dfa3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21717)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870275566

   
   ## CI report:
   
   * ea30f15a71231cfb40b57bbae1642e731945fa69 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21183)
 
   * 53efe25403ee38f26cf91633a4dc6adca17c5380 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21715)
 
   * 7c6d62fd030999f519b43112cda3ce984b80dfa3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21717)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870269920

   
   ## CI report:
   
   * ea30f15a71231cfb40b57bbae1642e731945fa69 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21183)
 
   * 53efe25403ee38f26cf91633a4dc6adca17c5380 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21715)
 
   * 7c6d62fd030999f519b43112cda3ce984b80dfa3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870232436

   
   ## CI report:
   
   * ea30f15a71231cfb40b57bbae1642e731945fa69 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21183)
 
   * 53efe25403ee38f26cf91633a4dc6adca17c5380 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21715)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-27 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1870227053

   
   ## CI report:
   
   * ea30f15a71231cfb40b57bbae1642e731945fa69 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21183)
 
   * 53efe25403ee38f26cf91633a4dc6adca17c5380 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-12-26 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1408766751


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##
@@ -334,6 +334,11 @@ public class HoodieIndexConfig extends HoodieConfig {
   .withDocumentation("Only applies when #recordIndexUseCaching is set. 
Determine what level of persistence is used to cache input RDDs. "
   + "Refer to org.apache.spark.storage.StorageLevel for different 
values");
 
+  public static final ConfigProperty BUCKET_QUERY_INDEX = 
ConfigProperty
+  .key("hoodie.bucket.query.index")

Review Comment:
   bucket query index will auto infer, this conf is give a choice when there is 
bug or special business scenario user can close this infer.
   
   special business scenario: user change the  bucket number but didn't rewrite 
history partitions, because history data will not update in their scenario, 
this case can not use bucket index query



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-30 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1411642781


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   No, multi-fields hasing index is a very basic use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-30 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410095547


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   In order to support index key pruning, only conjunction predicates that 
concantenate equasions of the hash keys are supported, see: 
https://github.com/apache/hudi/blob/d1c4ead8a80bc44731f1b615ba9166041c144948/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java#L315



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-30 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1411642096


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   If the query specifies max/min predicates, we can still skip this file after 
we do bucket pruning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410172249


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   > In order to support index key pruning, only conjunction predicates that 
concantenate equastions of the hash keys are supported, see
   
   yes, I have also considered this scenario, but its usage is very restricted 
and not very flexible. And the business use cases are also limited. I can 
independently apply additional logic processing to multiple fields, decoupling 
it from the processing of a single field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410164643


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")

Review Comment:
   sounds good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410164092


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   > file skipping within a file group
   sorry, I can't understand the meaning of this sentence, can you explain it 
in detail?
   in my opinion, a file group consists of multiple file silce. And 
allBaseeFile will returen latest file slice.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   > file skipping within a file group
   
   sorry, I can't understand the meaning of this sentence, can you explain it 
in detail?
   in my opinion, a file group consists of multiple file silce. And 
allBaseeFile will returen latest file slice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410095547


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   In order to support index key pruning, only conjunction predicates that 
concantenate equastions of the hash keys are supported, see: 
https://github.com/apache/hudi/blob/d1c4ead8a80bc44731f1b615ba9166041c144948/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java#L315



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410092117


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   Maybe you just take a reference of the Flink implementation: 
https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410090727


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")

Review Comment:
   file slice is an internal notion, maybe you should say `The query predicates 
does not specify equality for all the hasing fields, ...`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410089268


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   We are just doing two level of pruning/skipping here:
   
   1. file group skipping with bucket index; (so that the overall candicates 
was pruned before next step)
   2. file skipping within a file group
   
   These two steps should be othogonal and we could have both, maybe RLI does 
not make sense when hash keys equals primary keys, but when hash keys are 
sub-set of record keys, we can still have the gains.
   
   And if there are some other predicates like max/min from the column stats, 
we can even skip a very special file then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-28 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1408788336


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   only support one hash field now, multiple hash field exppress parse is hard 
to realize. And consider use scenarios, the United key usually does not use all 
fields as the filter condition in the query, which makes it difficult to hit 
bucket index, user often use part of united key. what do you think about?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

  1   2   >