Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-1833257061

   
   ## CI report:
   
   * a9ac4a84bfe187f9a85815aa0ce7f766f7e0b76e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21247)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-11-29 Thread via GitHub


stream2000 commented on PR #8062:
URL: https://github.com/apache/hudi/pull/8062#issuecomment-1833238919

   For 1.0.0 and later hudi version which supports efficient completion time 
queries on the timeline(#9565), we can get partition's `lastModifiedTime` by 
scanning the timeline and get the last write commit for the partition. Also for 
efficiency, we can store the partitions' last modified time and current 
completion time in the replace commit metadata. The next time we need to 
calculate the partitions' last modified time, we can build incrementally from 
the replace commit metadata of the last ttl management.
   
   @danny0405 Added new `lastModifiedTime` calculation method for 1.0.0 and 
later hudi version. We plan to implement the file listing based 
`lastModifiedTime` at first and implement the timeline-based `lastModifiedTime` 
calculation in a separate PR. This will help users with earlier hudi versions 
easy to pick the function to their code base. 
   
   I have addressed all comments according to online/offline discussions. If 
there is no other concern, we can move on this~ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Kafka Avro Confluent Schema Registry version 7 compatibility issues [hudi]

2023-11-29 Thread via GitHub


zachtrong opened a new issue, #10217:
URL: https://github.com/apache/hudi/issues/10217

   **Context**
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   Yes
   
   **Problem description**
   
   Latest Apache Hudi release (0.14.0) are using 
kafka-avro-serializer-5.3.4.jar, which causes deserialization issues when apply 
with Kafka Avro datasource and confluent REST api version 6/7.
   
   Sample Avro schema:
   ```
   
{"id":36,"subject":"test-value","version":12,"schema":"{\"type\":\"record\",\"name\":\"test\",\"namespace\":\"test\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null}],\"connect.name\":\"mongo.bot.test.test\"}","references":[]}
   ```
   Key error is **"references": []**.
   
   **Suggestion**
   Upgrade jar to io.confluent:kafka-avro-serializer:7.5.1 from confluent 
repository.
   
   Reference: 
[kafka-avro-serializer:7.5.1](https://github.com/confluentinc/schema-registry/blob/v7.5.1/client/src/main/java/io/confluent/kafka/schemaregistry/client/rest/entities/SchemaString.java)
   
   **Expected behavior**
   
   Apache Hudi is able to parse the above Avro schema without error.
   
   **Environment Description**
   
   * Hudi version: 0.14.0
   
   * Spark version: 3.4.1
   
   * Hive version: 3.1.3
   
   * Hadoop version: 3.3.4
   
   * Storage (HDFS/S3/GCS..): S3
   
   * Running on Docker? (yes/no): yes
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Caused by: org.apache.kafka.common.errors.SerializationException: Error 
deserializing Avro message for id 25
   Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
Unrecognized field "references" (class 
io.confluent.kafka.schemaregistry.client.rest.entities.SchemaString), not 
marked as ignorable (one known property: "schema"])
at [Source: (sun.net.www.protocol.http.HttpURLConnection$HttpInputStream); 
line: 1, column: 2063] (through reference chain: 
io.confluent.kafka.schemaregistry.client.rest.entities.SchemaString["references"])
   at 
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61)
   at 
com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:1132)
   at 
com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:2202)
   at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1705)
   at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1683)
   at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:320)
   at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177)
   at 
com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323)
   at 
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4730)
   at 
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3722)
   at 
io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:221)
   at 
io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:265)
   at 
io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:495)
   at 
io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:488)
   at 
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:177)
   at 
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getBySubjectAndId(CachedSchemaRegistryClient.java:256)
   at 
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getById(CachedSchemaRegistryClient.java:235)
   at 
io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:107)
   at 
org.apache.hudi.utilities.deser.KafkaAvroSchemaDeserializer.deserialize(KafkaAvroSchemaDeserializer.java:79)
   at 
io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:79)
   at 
io.confluent.kafka.serializers.KafkaAvroDeserializer.deserialize(KafkaAvroDeserializer.java:55)
   at 
org.apache.kafka.common.serialization.Deserializer.deserialize(Deserializer.java:60)
   at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseRecord(Fetcher.java:1386)
   at 
org.apache.kafka.clients.consumer.internals.Fetcher.access$3400(Fetcher.java:133)
   at 
org.apache.kafka.clients.consumer.

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10209:
URL: https://github.com/apache/hudi/pull/10209#issuecomment-1833155126

   
   ## CI report:
   
   * a22a697252040fc83d09e2b443942859b0b1d421 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21218)
 
   * a5136d4c7459d2902dc00750c24b5e48820ea619 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21248)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10209:
URL: https://github.com/apache/hudi/pull/10209#issuecomment-1833149284

   
   ## CI report:
   
   * a22a697252040fc83d09e2b443942859b0b1d421 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21218)
 
   * a5136d4c7459d2902dc00750c24b5e48820ea619 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-1833149163

   
   ## CI report:
   
   * 3c894596a90a326707d4aa052e34cf9f09daae75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21224)
 
   * a9ac4a84bfe187f9a85815aa0ce7f766f7e0b76e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21247)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1833149064

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * 993d0dc63f3f552b8bf5b52c113f3ae8ef53304c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21241)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7077] Fix OOM error for a test [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10216:
URL: https://github.com/apache/hudi/pull/10216#issuecomment-1833143559

   
   ## CI report:
   
   * a015a4f2dce1814e8387a41a6cf9842404a874c1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21245)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1833143354

   
   ## CI report:
   
   * fcc90e964c0ee3a12f0f90bf216051e0bc3b7eaa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21244)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-1833143442

   
   ## CI report:
   
   * 3c894596a90a326707d4aa052e34cf9f09daae75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21224)
 
   * a9ac4a84bfe187f9a85815aa0ce7f766f7e0b76e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410172249


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   > In order to support index key pruning, only conjunction predicates that 
concantenate equastions of the hash keys are supported, see
   
   yes, I have also considered this scenario, but its usage is very restricted 
and not very flexible. And the business use cases are also limited. I can 
independently apply additional logic processing to multiple fields, decoupling 
it from the processing of a single field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub

Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410164643


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")

Review Comment:
   sounds good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


KnightChess commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410164092


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   > file skipping within a file group
   sorry, I can't understand the meaning of this sentence, can you explain it 
in detail?
   in my opinion, a file group consists of multiple file silce. And 
allBaseeFile will returen latest file slice.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   > file skipping within a file group
   
   sorry, I can't understand the meaning of this sentence, can you explain it 
in detail?
   in my opinion, a file group consists of multiple file silce. And 
allBaseeFile will returen latest file slice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields (#10212)

2023-11-29 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 38db88c8a2b [HUDI-7160] Copy over schema properties when adding Hudi 
Metadata fields (#10212)
38db88c8a2b is described below

commit 38db88c8a2bb0c378295324692c4c0388e60e466
Author: Tim Brown 
AuthorDate: Wed Nov 29 22:54:12 2023 -0600

[HUDI-7160] Copy over schema properties when adding Hudi Metadata fields 
(#10212)
---
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  3 +++
 .../org/apache/hudi/avro/TestHoodieAvroUtils.java  | 25 ++
 2 files changed, 28 insertions(+)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
index 3800d9c1053..ac7dcd42979 100644
--- a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
@@ -302,6 +302,9 @@ public class HoodieAvroUtils {
 }
 
 Schema mergedSchema = Schema.createRecord(schema.getName(), 
schema.getDoc(), schema.getNamespace(), false);
+for (Map.Entry prop : schema.getObjectProps().entrySet()) {
+  mergedSchema.addProp(prop.getKey(), prop.getValue());
+}
 mergedSchema.setFields(parentFields);
 return mergedSchema;
   }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java
index 28b05435244..eb20081475f 100644
--- a/hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java
+++ b/hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java
@@ -99,6 +99,12 @@ public class TestHoodieAvroUtils {
   + "{\"name\": \"non_pii_col\", \"type\": \"string\"},"
   + "{\"name\": \"pii_col\", \"type\": \"string\", \"column_category\": 
\"user_profile\"}]}";
 
+  private static final String EXAMPLE_SCHEMA_WITH_PROPS = "{\"type\": 
\"record\",\"name\": \"testrec\",\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\", 
\"custom_field_property\":\"value\"},{\"name\": \"_row_key\", \"type\": 
\"string\"},"
+  + "{\"name\": \"non_pii_col\", \"type\": \"string\"},"
+  + "{\"name\": \"pii_col\", \"type\": \"string\", \"column_category\": 
\"user_profile\"}], "
+  + "\"custom_schema_property\": \"custom_schema_property_value\"}";
+
   private static int NUM_FIELDS_IN_EXAMPLE_SCHEMA = 4;
 
   private static String SCHEMA_WITH_METADATA_FIELD = "{\"type\": 
\"record\",\"name\": \"testrec2\",\"fields\": [ "
@@ -604,4 +610,23 @@ public class TestHoodieAvroUtils {
   .subtract((BigDecimal) 
unwrapAvroValueWrapper(wrapperValue)).toPlainString());
 }
   }
+
+  @Test
+  public void testAddMetadataFields() {
+Schema baseSchema = new Schema.Parser().parse(EXAMPLE_SCHEMA_WITH_PROPS);
+Schema schemaWithMetadata = HoodieAvroUtils.addMetadataFields(baseSchema);
+List updatedFields = schemaWithMetadata.getFields();
+// assert fields added in expected order
+assertEquals(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
updatedFields.get(0).name());
+assertEquals(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
updatedFields.get(1).name());
+assertEquals(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
updatedFields.get(2).name());
+assertEquals(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
updatedFields.get(3).name());
+assertEquals(HoodieRecord.FILENAME_METADATA_FIELD, 
updatedFields.get(4).name());
+// assert original fields are copied over
+List originalFieldsInUpdatedSchema = 
updatedFields.subList(5, updatedFields.size());
+assertEquals(baseSchema.getFields(), originalFieldsInUpdatedSchema);
+// validate properties are properly copied over
+assertEquals("custom_schema_property_value", 
schemaWithMetadata.getProp("custom_schema_property"));
+assertEquals("value", 
originalFieldsInUpdatedSchema.get(0).getProp("custom_field_property"));
+  }
 }



Re: [PR] [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields [hudi]

2023-11-29 Thread via GitHub


nsivabalan merged PR #10212:
URL: https://github.com/apache/hudi/pull/10212


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7161] Add commit action type and extra metadata to write callback on commit message (#10213)

2023-11-29 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b244e5a7b7b [HUDI-7161] Add commit action type and extra metadata to 
write callback on commit message (#10213)
b244e5a7b7b is described below

commit b244e5a7b7b4f806d51663d602b39fd724ed5d62
Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com>
AuthorDate: Wed Nov 29 20:53:34 2023 -0800

[HUDI-7161] Add commit action type and extra metadata to write callback on 
commit message (#10213)


-

Co-authored-by: rmahindra123 
---
 .../common/HoodieWriteCommitCallbackMessage.java   | 36 +-
 .../apache/hudi/client/BaseHoodieWriteClient.java  |  3 +-
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/common/HoodieWriteCommitCallbackMessage.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/common/HoodieWriteCommitCallbackMessage.java
index 8210693a756..808f643da56 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/common/HoodieWriteCommitCallbackMessage.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/common/HoodieWriteCommitCallbackMessage.java
@@ -20,9 +20,11 @@ package org.apache.hudi.callback.common;
 import org.apache.hudi.ApiMaturityLevel;
 import org.apache.hudi.PublicAPIClass;
 import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.util.Option;
 
 import java.io.Serializable;
 import java.util.List;
+import java.util.Map;
 
 /**
  * Base callback message, which contains commitTime and tableName only for now.
@@ -52,11 +54,35 @@ public class HoodieWriteCommitCallbackMessage implements 
Serializable {
*/
   private final List hoodieWriteStat;
 
-  public HoodieWriteCommitCallbackMessage(String commitTime, String tableName, 
String basePath, List hoodieWriteStat) {
+  /**
+   * Action Type of the commit.
+   */
+  private final Option commitActionType;
+
+  /**
+   * Extra metadata in the commit.
+   */
+  private final Option> extraMetadata;
+
+  public HoodieWriteCommitCallbackMessage(String commitTime,
+  String tableName,
+  String basePath,
+  List 
hoodieWriteStat) {
+this(commitTime, tableName, basePath, hoodieWriteStat, Option.empty(), 
Option.empty());
+  }
+
+  public HoodieWriteCommitCallbackMessage(String commitTime,
+  String tableName,
+  String basePath,
+  List 
hoodieWriteStat,
+  Option commitActionType,
+  Option> 
extraMetadata) {
 this.commitTime = commitTime;
 this.tableName = tableName;
 this.basePath = basePath;
 this.hoodieWriteStat = hoodieWriteStat;
+this.commitActionType = commitActionType;
+this.extraMetadata = extraMetadata;
   }
 
   public String getCommitTime() {
@@ -74,4 +100,12 @@ public class HoodieWriteCommitCallbackMessage implements 
Serializable {
   public List getHoodieWriteStat() {
 return hoodieWriteStat;
   }
+
+  public Option getCommitActionType() {
+return commitActionType;
+  }
+
+  public Option> getExtraMetadata() {
+return extraMetadata;
+  }
 }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
index 7dbd07ea1cc..a3aa6699027 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
@@ -265,7 +265,8 @@ public abstract class BaseHoodieWriteClient 
extends BaseHoodieClient
   if (null == commitCallback) {
 commitCallback = HoodieCommitCallbackFactory.create(config);
   }
-  commitCallback.call(new HoodieWriteCommitCallbackMessage(instantTime, 
config.getTableName(), config.getBasePath(), stats));
+  commitCallback.call(new HoodieWriteCommitCallbackMessage(
+  instantTime, config.getTableName(), config.getBasePath(), stats, 
Option.of(commitActionType), extraMetadata));
 }
 return true;
   }



Re: [PR] [HUDI-7161] Add commit action type and extra metadata to write callback on commit message [hudi]

2023-11-29 Thread via GitHub


nsivabalan merged PR #10213:
URL: https://github.com/apache/hudi/pull/10213


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-11-29 Thread via GitHub


linliu-code commented on code in PR #10167:
URL: https://github.com/apache/hudi/pull/10167#discussion_r1410140356


##
hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala:
##
@@ -127,4 +126,17 @@ class Spark3_5Adapter extends BaseSpark3Adapter {
 case OFF_HEAP => "OFF_HEAP"
 case _ => throw new IllegalArgumentException(s"Invalid StorageLevel: 
$level")
   }
+
+  override def appendRowIndexColumnForParquetFileReader(requiredSchema: 
StructType, shouldUseRecordPosition: Boolean): StructType = {
+if (shouldUseRecordPosition) StructType(requiredSchema.toArray :+ 
FileSourceGeneratedMetadataStructField(
+  ROW_INDEX_TEMPORARY_COLUMN_NAME, ROW_INDEX_TEMPORARY_COLUMN_NAME, 
LongType, nullable = false)) else requiredSchema
+  }
+
+  override def appendRowIndexColumnForFileGroupReader(requiredSchema: 
StructType, shouldUseRecordPosition: Boolean): StructType = {
+if (shouldUseRecordPosition) StructType(requiredSchema.toArray :+ 
ROW_INDEX_FIELD) else requiredSchema
+  }
+
+  override def getDataFilters(requiredFilters: Seq[Filter], 
recordKeyRelatedFilters: Seq[Filter], shouldUseRecordPosition: Boolean): 
Seq[Filter] = {
+requiredFilters ++ recordKeyRelatedFilters
+  }

Review Comment:
   Do you have any examples to do that? What benefits?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-11-29 Thread via GitHub


linliu-code commented on code in PR #10167:
URL: https://github.com/apache/hudi/pull/10167#discussion_r1410140489


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -300,12 +303,13 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
 val baseFileReader = super.buildReaderWithPartitionValues(sparkSession, 
dataSchema, partitionSchema, requiredSchema,
   filters ++ requiredFilters, options, new Configuration(hadoopConf))
 
-//file reader for reading a hudi base file that needs to be merged with 
log files
+// File reader for reading a Hoodie base file that needs to be merged with 
log files
 val preMergeBaseFileReader = if (isMOR) {
   // Add support for reading files using inline file system.
-  super.buildReaderWithPartitionValues(sparkSession, dataSchema, 
partitionSchema, requiredSchemaWithMandatory,
-if (shouldUseRecordPosition) requiredFilters else 
recordKeyRelatedFilters ++ requiredFilters,
-options, new Configuration(hadoopConf))
+  val appliedRequiredSchema = 
sparkAdapter.appendRowIndexColumnForParquetFileReader(requiredSchemaWithMandatory,
 shouldUseRecordPosition)

Review Comment:
   Ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7077] Fix OOM error for a test [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10216:
URL: https://github.com/apache/hudi/pull/10216#issuecomment-1833073697

   
   ## CI report:
   
   * 9206bb059d0f22ee3e1110b3d269ce2f777c358f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21243)
 
   * a015a4f2dce1814e8387a41a6cf9842404a874c1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21245)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1833073505

   
   ## CI report:
   
   * 6de03812710b802b83d8b2efb8c31849d4be0202 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21242)
 
   * fcc90e964c0ee3a12f0f90bf216051e0bc3b7eaa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21244)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow concurrent modification for heartbeat map [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10215:
URL: https://github.com/apache/hudi/pull/10215#issuecomment-1833068070

   
   ## CI report:
   
   * bd5d820f323c66fbcf7492c61d23585a581e76cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21238)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1833063296

   
   ## CI report:
   
   * 3e63db8a1620a25197071d21714d06144f1fbb04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21237)
 
   * 6de03812710b802b83d8b2efb8c31849d4be0202 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21242)
 
   * fcc90e964c0ee3a12f0f90bf216051e0bc3b7eaa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7077] Fix OOM error for a test [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10216:
URL: https://github.com/apache/hudi/pull/10216#issuecomment-1833058403

   
   ## CI report:
   
   * 9206bb059d0f22ee3e1110b3d269ce2f777c358f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21243)
 
   * a015a4f2dce1814e8387a41a6cf9842404a874c1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the memory usage of timeline server for table service [hudi]

2023-11-29 Thread via GitHub


zhuanshenbsj1 commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1833053275

   > Hi, Could you please merge the 0.14.1 (0.x) branch to support it? When 
will version 0.14.1 be released?
   > 
   > @zhuanshenbsj1 @danny0405
   
   Yes, it will merge to 0.14.1 and will be released soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the memory usage of timeline server for table service [hudi]

2023-11-29 Thread via GitHub


zyclove commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1833035662

   Hi, Could you please merge the 0.14.1 (0.x) branch to support it? When will 
version 0.14.1 be released?
   
   
   @zhuanshenbsj1 @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-11-29 Thread via GitHub


yihua commented on code in PR #10167:
URL: https://github.com/apache/hudi/pull/10167#discussion_r1410100073


##
hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala:
##
@@ -127,4 +126,17 @@ class Spark3_5Adapter extends BaseSpark3Adapter {
 case OFF_HEAP => "OFF_HEAP"
 case _ => throw new IllegalArgumentException(s"Invalid StorageLevel: 
$level")
   }
+
+  override def appendRowIndexColumnForParquetFileReader(requiredSchema: 
StructType, shouldUseRecordPosition: Boolean): StructType = {
+if (shouldUseRecordPosition) StructType(requiredSchema.toArray :+ 
FileSourceGeneratedMetadataStructField(
+  ROW_INDEX_TEMPORARY_COLUMN_NAME, ROW_INDEX_TEMPORARY_COLUMN_NAME, 
LongType, nullable = false)) else requiredSchema
+  }
+
+  override def appendRowIndexColumnForFileGroupReader(requiredSchema: 
StructType, shouldUseRecordPosition: Boolean): StructType = {
+if (shouldUseRecordPosition) StructType(requiredSchema.toArray :+ 
ROW_INDEX_FIELD) else requiredSchema
+  }
+
+  override def getDataFilters(requiredFilters: Seq[Filter], 
recordKeyRelatedFilters: Seq[Filter], shouldUseRecordPosition: Boolean): 
Seq[Filter] = {
+requiredFilters ++ recordKeyRelatedFilters
+  }

Review Comment:
   Can we directly use Spark version as the criteria to fetch the row index 
with the Spark parquet reader instead of adding new APIs here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1833029957

   
   ## CI report:
   
   * 3e63db8a1620a25197071d21714d06144f1fbb04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21237)
 
   * 6de03812710b802b83d8b2efb8c31849d4be0202 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21242)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7077] Fix OOM error for a test [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10216:
URL: https://github.com/apache/hudi/pull/10216#issuecomment-1833030144

   
   ## CI report:
   
   * 9206bb059d0f22ee3e1110b3d269ce2f777c358f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21243)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Adding test for composite keys with bulk insert row writer [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10214:
URL: https://github.com/apache/hudi/pull/10214#issuecomment-1833030107

   
   ## CI report:
   
   * 0ee77f22a2f213a1c581e443a52eb6965832abc4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1833029928

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * 31fe075c72fb189b9155e48ab3399e9199cc293a Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21231)
 
   * 993d0dc63f3f552b8bf5b52c113f3ae8ef53304c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21241)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410095547


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   In order to support index key pruning, only conjunction predicates that 
concantenate equastions of the hash keys are supported, see: 
https://github.com/apache/hudi/blob/d1c4ead8a80bc44731f1b615ba9166041c144948/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java#L315



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apach

Re: [PR] [HUDI-7077] Fix OOM error for a test [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10216:
URL: https://github.com/apache/hudi/pull/10216#issuecomment-1833023982

   
   ## CI report:
   
   * 9206bb059d0f22ee3e1110b3d269ce2f777c358f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1833023807

   
   ## CI report:
   
   * 3e63db8a1620a25197071d21714d06144f1fbb04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21237)
 
   * 6de03812710b802b83d8b2efb8c31849d4be0202 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1833023754

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * 31fe075c72fb189b9155e48ab3399e9199cc293a Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21231)
 
   * 993d0dc63f3f552b8bf5b52c113f3ae8ef53304c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-11-29 Thread via GitHub


yihua commented on code in PR #10167:
URL: https://github.com/apache/hudi/pull/10167#discussion_r1410093495


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -300,12 +303,13 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
 val baseFileReader = super.buildReaderWithPartitionValues(sparkSession, 
dataSchema, partitionSchema, requiredSchema,
   filters ++ requiredFilters, options, new Configuration(hadoopConf))
 
-//file reader for reading a hudi base file that needs to be merged with 
log files
+// File reader for reading a Hoodie base file that needs to be merged with 
log files
 val preMergeBaseFileReader = if (isMOR) {
   // Add support for reading files using inline file system.
-  super.buildReaderWithPartitionValues(sparkSession, dataSchema, 
partitionSchema, requiredSchemaWithMandatory,
-if (shouldUseRecordPosition) requiredFilters else 
recordKeyRelatedFilters ++ requiredFilters,
-options, new Configuration(hadoopConf))
+  val appliedRequiredSchema = 
sparkAdapter.appendRowIndexColumnForParquetFileReader(requiredSchemaWithMandatory,
 shouldUseRecordPosition)

Review Comment:
   Could you add a test to validate the logic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410092117


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")
+None
+  } else {
+Some(matchedBuckets)
+  }
+}
+  }
+
+  private def getExpressionBuckets(expr: Expression, bucketColumnName: String, 
numBuckets: Int): BitSet = {
+
+def getBucketNumber(attr: Attribute, v: Any): Int = {
+  
BucketIdentifier.getBucketId(JavaConverters.seqAsJavaListConverter(List.apply(String.valueOf(v))).asJava,
 numBuckets)

Review Comment:
   Maybe you just take a reference of the Flink implementation: 
https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7161] Add commit action type and extra metadata to write callback on commit message [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10213:
URL: https://github.com/apache/hudi/pull/10213#issuecomment-1833018200

   
   ## CI report:
   
   * 3ac05bdf864a129a74110e1ddacf1f0c8a85 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21234)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1833017794

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * 31fe075c72fb189b9155e48ab3399e9199cc293a Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410090727


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BucketIndexSupport.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.FileStatus
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieIndexConfig
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.index.HoodieIndex.IndexType
+import org.apache.hudi.index.bucket.BucketIdentifier
+import org.apache.log4j.LogManager
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, EmptyRow, 
Expression, Literal}
+import org.apache.spark.sql.types.{DoubleType, FloatType}
+import org.apache.spark.util.collection.BitSet
+
+import scala.collection.{JavaConverters, mutable}
+
+class BucketIndexSupport(metadataConfig: HoodieMetadataConfig) {
+
+  private val log = LogManager.getLogger(getClass);
+
+  /**
+   * Returns the configured bucket field for the table
+   */
+  private def getBucketHashField: Option[String] = {
+val bucketHashFields = 
metadataConfig.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD)
+if (bucketHashFields == null) {
+  val recordKeys = 
metadataConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
+  if (recordKeys == null) {
+Option.apply(null)
+  } else {
+val recordKeyArray = recordKeys.split(",")
+if (recordKeyArray.length == 1) {
+  Option.apply(recordKeyArray(0))
+} else {
+  log.warn("bucket query index only support one bucket field")
+  Option.apply(null)
+}
+  }
+} else {
+  val fields = bucketHashFields.split(",")
+  if (fields.length == 1) {
+Option.apply(fields(0))
+  } else {
+log.warn("bucket query index only support one bucket field")
+Option.apply(null)
+  }
+}
+  }
+
+  def getCandidateFiles(allFiles: Seq[FileStatus], bucketIds: BitSet): 
Set[String] = {
+val candidateFiles: mutable.Set[String] = mutable.Set.empty
+for (file <- allFiles) {
+  val fileId = FSUtils.getFileIdFromFilePath(file.getPath)
+  val fileBucketId = BucketIdentifier.bucketIdFromFileId(fileId)
+  if (bucketIds.get(fileBucketId)) {
+candidateFiles += file.getPath.getName
+  }
+}
+candidateFiles.toSet
+  }
+
+  def filterQueriesWithBucketHashField(queryFilters: Seq[Expression]): 
Option[BitSet] = {
+val bucketNumber = 
metadataConfig.getInt(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS)
+val bucketHashFieldOpt = getBucketHashField
+if (bucketHashFieldOpt.isEmpty || queryFilters.isEmpty) {
+  None
+} else {
+  val matchedBuckets = getExpressionBuckets(queryFilters.reduce(And), 
bucketHashFieldOpt.get, bucketNumber)
+
+  val numBucketsSelected = matchedBuckets.cardinality()
+
+  // None means all the buckets need to be scanned
+  if (numBucketsSelected == bucketNumber) {
+log.info("bucket query match all file slice, fallback other index")

Review Comment:
   file slice is an internal notion, maybe you should say `The query predicates 
does not specify equality for all the hasing fields, ...`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on code in PR #10191:
URL: https://github.com/apache/hudi/pull/10191#discussion_r1410089268


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -340,9 +347,18 @@ case class HoodieFileIndex(spark: SparkSession,
 //   and candidate files are obtained from these file slices.
 
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
-
+// bucket query index
+var bucketIds = Option.empty[BitSet]
+if (bucketIndex.isIndexAvailable && isDataSkippingEnabled) {
+  bucketIds = bucketIndex.filterQueriesWithBucketHashField(queryFilters)
+}
+// record index
 lazy val (_, recordKeys) = 
recordLevelIndex.filterQueriesWithRecordKey(queryFilters)
-if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
+
+// index chose
+if (bucketIndex.isIndexAvailable && bucketIds.isDefined && 
bucketIds.get.cardinality() > 0) {
+  Option.apply(bucketIndex.getCandidateFiles(allBaseFiles, bucketIds.get))

Review Comment:
   We are just doing two level of pruning/skipping here:
   
   1. file group skipping with bucket index; (so that the overall candicates 
was pruned before next step)
   2. file skipping within a file group
   
   These two steps should be othogonal and we could have both, maybe RLI does 
not make sense when hash keys equals primary keys, but when hash keys are 
sub-set of record keys, we can still have the gains.
   
   And if there are some other predicates like max/min from the column stats, 
we can even skip a very special file then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7077) Re-enable tests in TestSparkDataSource

2023-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7077:
-
Labels: pull-request-available  (was: )

> Re-enable tests in TestSparkDataSource
> --
>
> Key: HUDI-7077
> URL: https://issues.apache.org/jira/browse/HUDI-7077
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In CI, TestSparkDataSource causes the job to fail due to memory issue but 
> locally the tests run fine.  The tests are disabled in TestSparkDataSource 
> temporarily.  We need to re-enable them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7077] Fix OOM error for a test [hudi]

2023-11-29 Thread via GitHub


linliu-code opened a new pull request, #10216:
URL: https://github.com/apache/hudi/pull/10216

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (5b5d7465c7b -> d1c4ead8a80)

2023-11-29 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 5b5d7465c7b [Minor] Remove useless ')' for ConfigProperty.toString  
(#10208)
 add d1c4ead8a80 [HUDI-7128][FOLLOW-UP] support metadatadelete with batch 
mode (#10210)

No new revisions were added by this update.

Summary of changes:
 .../procedures/DeleteMetadataTableProcedure.scala  | 22 +---
 .../sql/hudi/procedure/TestMetadataProcedure.scala | 58 ++
 2 files changed, 72 insertions(+), 8 deletions(-)



Re: [PR] [HUDI-7128][FOLLOW-UP] Support metadatadelete with batch mode [hudi]

2023-11-29 Thread via GitHub


danny0405 merged PR #10210:
URL: https://github.com/apache/hudi/pull/10210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Remove useless ) for ConfigProperty.toString [hudi]

2023-11-29 Thread via GitHub


danny0405 merged PR #10208:
URL: https://github.com/apache/hudi/pull/10208


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Remove useless ) for ConfigProperty.toString [hudi]

2023-11-29 Thread via GitHub


danny0405 commented on PR #10208:
URL: https://github.com/apache/hudi/pull/10208#issuecomment-1833002690

   Test passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=21216&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [Minor] Remove useless ')' for ConfigProperty.toString (#10208)

2023-11-29 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 5b5d7465c7b [Minor] Remove useless ')' for ConfigProperty.toString  
(#10208)
5b5d7465c7b is described below

commit 5b5d7465c7b8f873fb6aabedd8846221b2709fa1
Author: hehuiyuan <471627...@qq.com>
AuthorDate: Thu Nov 30 10:24:17 2023 +0800

[Minor] Remove useless ')' for ConfigProperty.toString  (#10208)
---
 .../src/main/java/org/apache/hudi/common/config/ConfigProperty.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/ConfigProperty.java 
b/hudi-common/src/main/java/org/apache/hudi/common/config/ConfigProperty.java
index d4ed193a041..aa2cf642309 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/config/ConfigProperty.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/ConfigProperty.java
@@ -233,7 +233,7 @@ public class ConfigProperty implements Serializable {
   @Override
   public String toString() {
 return String.format(
-"Key: '%s' , default: %s , isAdvanced: %s , description: %s since 
version: %s deprecated after: %s)",
+"Key: '%s' , default: %s , isAdvanced: %s , description: %s since 
version: %s deprecated after: %s",
 key, defaultValue, advanced, doc, sinceVersion.isPresent() ? 
sinceVersion.get() : "version is not defined",
 deprecatedVersion.isPresent() ? deprecatedVersion.get() : "version is 
not defined");
   }



Re: [PR] [HUDI-7128][FOLLOW-UP] Support metadatadelete with batch mode [hudi]

2023-11-29 Thread via GitHub


xuzifu666 commented on PR #10210:
URL: https://github.com/apache/hudi/pull/10210#issuecomment-1832983093

   cc @danny0405 PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1832972393

   
   ## CI report:
   
   * 3e63db8a1620a25197071d21714d06144f1fbb04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21237)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832972346

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * 747894399d3bef0e05c561d9c67db61ab2536cf9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21228)
 
   * 31fe075c72fb189b9155e48ab3399e9199cc293a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow concurrent modification for heartbeat map [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10215:
URL: https://github.com/apache/hudi/pull/10215#issuecomment-1832935708

   
   ## CI report:
   
   * bd5d820f323c66fbcf7492c61d23585a581e76cc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21238)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832935577

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * 66d442d8f652fbd5251dabee5f2c141dbae19821 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21236)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1832935446

   
   ## CI report:
   
   * 847fee8e1ce7b0e2d9af6dadbc802f4d67f06ee7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21171)
 
   * 3e63db8a1620a25197071d21714d06144f1fbb04 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21237)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow concurrent modification for heartbeat map [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10215:
URL: https://github.com/apache/hudi/pull/10215#issuecomment-1832930088

   
   ## CI report:
   
   * bd5d820f323c66fbcf7492c61d23585a581e76cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832929954

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * f7613c8544b014519fe0142a3a42b72fbfc698a3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21235)
 
   * 66d442d8f652fbd5251dabee5f2c141dbae19821 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1832929841

   
   ## CI report:
   
   * 847fee8e1ce7b0e2d9af6dadbc802f4d67f06ee7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21171)
 
   * 3e63db8a1620a25197071d21714d06144f1fbb04 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832923838

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * d054e55f468fcf6ad312f6d4c4100e69f7554715 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21232)
 
   * f7613c8544b014519fe0142a3a42b72fbfc698a3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-11-29 Thread via GitHub


linliu-code commented on PR #10167:
URL: https://github.com/apache/hudi/pull/10167#issuecomment-1832916848

   
   @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Allow concurrent modification for heartbeat map [hudi]

2023-11-29 Thread via GitHub


linliu-code opened a new pull request, #10215:
URL: https://github.com/apache/hudi/pull/10215

   ### Change Logs
   
   Previously we see the ConcurrentModificationException exception.
   
   ### Impact
   
   1. Make the test less flaky.
   2. More robust in prod.
   
   ### Risk level (write none, low medium or high below)
   
   Low.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6379] [DO NOT MERGE] Pulsar version change to fix snakeyaml CVE [hudi]

2023-11-29 Thread via GitHub


CTTY commented on PR #8973:
URL: https://github.com/apache/hudi/pull/8973#issuecomment-1832888434

   I assume this is no longer needed since we have #9670 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7162) RDD's Don't cache in some situations with new filegroup reader + new parquet file format

2023-11-29 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7162:
-

 Summary: RDD's Don't cache in some situations with new filegroup 
reader + new parquet file format
 Key: HUDI-7162
 URL: https://issues.apache.org/jira/browse/HUDI-7162
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark, spark-sql
Reporter: Jonathan Vexler


"Test Call rollback_to_instant Procedure with refreshTable" 

Fails if a projection is added to the query plan. The test does not currently 
fail, because we don't do the project for non-partitioned tables. Adding the 
projection prevents the rdd from being cached.

Query plans:

without projection, caching works:
{code:java}
== Parsed Logical Plan =='Project ['id]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L]
 parquet
== Analyzed Logical Plan ==id: intProject [id#552]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L]
 parquet
== Optimized Logical Plan ==InMemoryRelation [id#552], StorageLevel(disk, 
memory, deserialized, 1 replicas)   +- *(1) ColumnarToRow  +- FileScan 
parquet default.h0[id#552] Batched: true, DataFilters: [], Format: Parquet, 
Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/spark-87b3...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
== Physical Plan ==InMemoryTableScan [id#552]   +- InMemoryRelation [id#552], 
StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) 
ColumnarToRow+- FileScan parquet default.h0[id#552] Batched: true, 
DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/spark-87b3...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code}
With projection, no caching:
{code:java}
== Parsed Logical Plan =='Project ['id]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
== Analyzed Logical Plan ==id: intProject [id#544]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
== Optimized Logical Plan ==Project [id#544]+- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
== Physical Plan ==*(1) ColumnarToRow+- FileScan parquet default.h0[id#544] 
Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/spark-8c60...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7161] Add commit action type and extra metadata to write callback on commit message [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10213:
URL: https://github.com/apache/hudi/pull/10213#issuecomment-1832885296

   
   ## CI report:
   
   * 3ac05bdf864a129a74110e1ddacf1f0c8a85 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21234)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Adding test for composite keys with bulk insert row writer [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10214:
URL: https://github.com/apache/hudi/pull/10214#issuecomment-1832885343

   
   ## CI report:
   
   * 0ee77f22a2f213a1c581e443a52eb6965832abc4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832885165

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * d054e55f468fcf6ad312f6d4c4100e69f7554715 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21232)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832884937

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * dfa3bdee07f850efbacb55ecc84637339a953423 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21225)
 
   * 747894399d3bef0e05c561d9c67db61ab2536cf9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21228)
 
   * 31fe075c72fb189b9155e48ab3399e9199cc293a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832880118

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * dfa3bdee07f850efbacb55ecc84637339a953423 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21225)
 
   * 747894399d3bef0e05c561d9c67db61ab2536cf9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21228)
 
   * 31fe075c72fb189b9155e48ab3399e9199cc293a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832880229

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * 9a3b347de974d626fdc52a5aafb06c5d2ec45cbd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21229)
 
   * d054e55f468fcf6ad312f6d4c4100e69f7554715 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Adding test for composite keys with bulk insert row writer [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10214:
URL: https://github.com/apache/hudi/pull/10214#issuecomment-1832880348

   
   ## CI report:
   
   * 0ee77f22a2f213a1c581e443a52eb6965832abc4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7161] Add commit action type and extra metadata to write callback on commit message [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10213:
URL: https://github.com/apache/hudi/pull/10213#issuecomment-1832880320

   
   ## CI report:
   
   * 3ac05bdf864a129a74110e1ddacf1f0c8a85 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10212:
URL: https://github.com/apache/hudi/pull/10212#issuecomment-1832874118

   
   ## CI report:
   
   * cfdccba5615427da35d9cba25a3867345f46265d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21226)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6933) bulk_insert Fails if one of the composite key contains null

2023-11-29 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791365#comment-17791365
 ] 

sivabalan narayanan commented on HUDI-6933:
---

https://github.com/apache/hudi/pull/10214

> bulk_insert Fails if one of the composite key contains null
> ---
>
> Key: HUDI-6933
> URL: https://issues.apache.org/jira/browse/HUDI-6933
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.1
>
>
> Github Issue- [https://github.com/apache/hudi/issues/9799]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6953] Adding test for composite keys with bulk insert row writer [hudi]

2023-11-29 Thread via GitHub


nsivabalan opened a new pull request, #10214:
URL: https://github.com/apache/hudi/pull/10214

   ### Change Logs
   
   Adding test for composite keys with bulk insert row writer
   
   ### Impact
   
   Improve test coverage
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6933) bulk_insert Fails if one of the composite key contains null

2023-11-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-6933:
-

Assignee: sivabalan narayanan

> bulk_insert Fails if one of the composite key contains null
> ---
>
> Key: HUDI-6933
> URL: https://issues.apache.org/jira/browse/HUDI-6933
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.1
>
>
> Github Issue- [https://github.com/apache/hudi/issues/9799]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7161) Add commit action type and ext ra metadata to write callback on commit message

2023-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7161:
-
Labels: pull-request-available  (was: )

> Add commit action type and ext ra metadata to write callback on commit message
> --
>
> Key: HUDI-7161
> URL: https://issues.apache.org/jira/browse/HUDI-7161
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
>
> Add commit action type and ext ra metadata to write callback on commit message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7161] Add commit action type and extra metadata to write callback on commit message [hudi]

2023-11-29 Thread via GitHub


rmahindra123 opened a new pull request, #10213:
URL: https://github.com/apache/hudi/pull/10213

   ### Change Logs
   
   Add commit action type and extra metadata to write callback on commit message
   
   ### Impact
   
   No impact on the commit callback API
   
   ### Risk level (write none, low medium or high below)
   
   low to medium
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7161) Add commit action type and ext ra metadata to write callback on commit message

2023-11-29 Thread Rajesh Mahindra (Jira)
Rajesh Mahindra created HUDI-7161:
-

 Summary: Add commit action type and ext ra metadata to write 
callback on commit message
 Key: HUDI-7161
 URL: https://issues.apache.org/jira/browse/HUDI-7161
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Rajesh Mahindra


Add commit action type and ext ra metadata to write callback on commit message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7161) Add commit action type and ext ra metadata to write callback on commit message

2023-11-29 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra reassigned HUDI-7161:
-

Assignee: Rajesh Mahindra

> Add commit action type and ext ra metadata to write callback on commit message
> --
>
> Key: HUDI-7161
> URL: https://issues.apache.org/jira/browse/HUDI-7161
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>
> Add commit action type and ext ra metadata to write callback on commit message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832836648

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * 9a3b347de974d626fdc52a5aafb06c5d2ec45cbd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21229)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832836486

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * dfa3bdee07f850efbacb55ecc84637339a953423 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21225)
 
   * 747894399d3bef0e05c561d9c67db61ab2536cf9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21228)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10185:
URL: https://github.com/apache/hudi/pull/10185#issuecomment-1832827368

   
   ## CI report:
   
   * 72201eb9e3ee19dc3e2cd815bc035af8f435b98f UNKNOWN
   * dd88a687f7a95799cc4da6e71809c679cdf91673 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21200)
 
   * 9a3b347de974d626fdc52a5aafb06c5d2ec45cbd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832827213

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * dfa3bdee07f850efbacb55ecc84637339a953423 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21225)
 
   * 747894399d3bef0e05c561d9c67db61ab2536cf9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


linliu-code commented on code in PR #10137:
URL: https://github.com/apache/hudi/pull/10137#discussion_r1409955152


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -77,14 +78,18 @@ class 
SparkFileFormatInternalRowReaderContext(baseFileReader: Option[Partitioned
   }
 }).asInstanceOf[ClosableIterator[InternalRow]]
 } else {
-  if (baseFileReader.isEmpty) {
-throw new IllegalArgumentException("Base file reader is missing when 
instantiating "
-  + "SparkFileFormatInternalRowReaderContext.");
+  val key = generateKey(dataSchema, requiredSchema)
+  if (!readerMaps.contains(key)) {
+throw new IllegalStateException("schemas don't hash to a known reader")
   }
-  new CloseableInternalRowIterator(baseFileReader.get.apply(fileInfo))
+  new CloseableInternalRowIterator(readerMaps(key).apply(fileInfo))
 }
   }
 
+  private def generateKey(dataSchema: Schema, requestedSchema: Schema): Long = 
{

Review Comment:
   Hi Jon, i really feel that if you can split this  PR into smaller PRs, that 
would be much easier for reviewers to understand and easier for the CI to pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10063:
URL: https://github.com/apache/hudi/pull/10063#issuecomment-1832819218

   
   ## CI report:
   
   * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN
   * e1999c6bb70849aa29723415791abac9879eff12 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21227)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7139) Fix operation type for bulk insert with row writer in Hudi Streamer

2023-11-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7139.
-
Resolution: Fixed

> Fix operation type for bulk insert with row writer in Hudi Streamer
> ---
>
> Key: HUDI-7139
> URL: https://issues.apache.org/jira/browse/HUDI-7139
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> {code:java}
> "operationType" : null {code}
> The operationType is null in the commit metadata of bulk insert operation 
> with row writer enabled in Hudi Streamer 
> (hoodie.streamer.write.row.writer.enable=true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7139) Fix operation type for bulk insert with row writer in Hudi Streamer

2023-11-29 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791352#comment-17791352
 ] 

sivabalan narayanan commented on HUDI-7139:
---

fixed in master 
[https://github.com/apache/hudi/commit/4f875edaecd495eaa8996fa8d81c102a971c599f]
 

> Fix operation type for bulk insert with row writer in Hudi Streamer
> ---
>
> Key: HUDI-7139
> URL: https://issues.apache.org/jira/browse/HUDI-7139
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> {code:java}
> "operationType" : null {code}
> The operationType is null in the commit metadata of bulk insert operation 
> with row writer enabled in Hudi Streamer 
> (hoodie.streamer.write.row.writer.enable=true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7155) Add log to print wrong number of instant metadata files

2023-11-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7155.
-
Resolution: Fixed

> Add log to print wrong number of instant metadata files
> ---
>
> Key: HUDI-7155
> URL: https://issues.apache.org/jira/browse/HUDI-7155
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving
>Reporter: zhuanshenbsj1
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7155) Add log to print wrong number of instant metadata files

2023-11-29 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791350#comment-17791350
 ] 

sivabalan narayanan commented on HUDI-7155:
---

fixed in master 
[https://github.com/apache/hudi/commit/817d81ad14f930c4744ff229640003fe7715b20c]
 

> Add log to print wrong number of instant metadata files
> ---
>
> Key: HUDI-7155
> URL: https://issues.apache.org/jira/browse/HUDI-7155
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving
>Reporter: zhuanshenbsj1
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832765012

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * dfa3bdee07f850efbacb55ecc84637339a953423 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21225)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10063:
URL: https://github.com/apache/hudi/pull/10063#issuecomment-1832764871

   
   ## CI report:
   
   * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN
   * 34efaac278dde7fd73515e6d54418a6ff8815326 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20939)
 
   * e1999c6bb70849aa29723415791abac9879eff12 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21227)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10063:
URL: https://github.com/apache/hudi/pull/10063#issuecomment-1832754560

   
   ## CI report:
   
   * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN
   * 34efaac278dde7fd73515e6d54418a6ff8815326 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20939)
 
   * e1999c6bb70849aa29723415791abac9879eff12 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-183274

   
   ## CI report:
   
   * 3c894596a90a326707d4aa052e34cf9f09daae75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21224)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10212:
URL: https://github.com/apache/hudi/pull/10212#issuecomment-1832693020

   
   ## CI report:
   
   * c106dd446a9ea4ec82cc00285c6b099c50555bfd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21222)
 
   * cfdccba5615427da35d9cba25a3867345f46265d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21226)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10212:
URL: https://github.com/apache/hudi/pull/10212#issuecomment-1832682300

   
   ## CI report:
   
   * c106dd446a9ea4ec82cc00285c6b099c50555bfd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21222)
 
   * cfdccba5615427da35d9cba25a3867345f46265d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832681988

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * bfc0a855cadb4f6329bd38a470ade931797c53ab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21223)
 
   * dfa3bdee07f850efbacb55ecc84637339a953423 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21225)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10212:
URL: https://github.com/apache/hudi/pull/10212#issuecomment-1832672521

   
   ## CI report:
   
   * c106dd446a9ea4ec82cc00285c6b099c50555bfd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21222)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832672177

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * bfc0a855cadb4f6329bd38a470ade931797c53ab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21223)
 
   * dfa3bdee07f850efbacb55ecc84637339a953423 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7137] Implement Bootstrap for new FG reader [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10137:
URL: https://github.com/apache/hudi/pull/10137#issuecomment-1832622066

   
   ## CI report:
   
   * 77205b47c45501a0d9de1ebc74d5bb8c960cd95a UNKNOWN
   * b0b711e0c355320da652fa7f2d8669539873d4d6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21196)
 
   * bfc0a855cadb4f6329bd38a470ade931797c53ab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-1832622253

   
   ## CI report:
   
   * 7d8ce155ad5b95f8a26150554a6008cec0ef0653 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21221)
 
   * 3c894596a90a326707d4aa052e34cf9f09daae75 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21224)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7160] Copy over schema properties when adding Hudi Metadata fields [hudi]

2023-11-29 Thread via GitHub


hudi-bot commented on PR #10212:
URL: https://github.com/apache/hudi/pull/10212#issuecomment-1832611516

   
   ## CI report:
   
   * c106dd446a9ea4ec82cc00285c6b099c50555bfd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21222)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >