Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2051884870

   
   ## CI report:
   
   * e7dde68f9c2bda3e1045d3bcda6c2472072395a0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23218)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2051884556

   
   ## CI report:
   
   * 8f1ba6d46d8777f39c522d8bcac545ba3d4fd544 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-12 Thread via GitHub


jonvex commented on code in PR #10615:
URL: https://github.com/apache/hudi/pull/10615#discussion_r1562569055


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala:
##
@@ -201,8 +201,26 @@ object HoodieWriterUtils {
   
diffConfigs.append(s"KeyGenerator:\t$datasourceKeyGen\t$tableConfigKeyGen\n")
 }
 
+// Please note that the validation of partition path fields needs the 
key generator class
+// for the table, since the custom key generator expects a different 
format of
+// the value of the write config 
"hoodie.datasource.write.partitionpath.field"
+// e.g., "col:simple,ts:timestamp", whereas the table config 
"hoodie.table.partition.fields"
+// in hoodie.properties stores "col,ts".
+// The "params" here may only contain the write config of partition 
path field,
+// so we need to pass in the validated key generator class name.
+val validatedKeyGenClassName = if (tableConfigKeyGen != null) {

Review Comment:
   So when `hoodie.datasource.write.partitionpath.field` is set, we don't set 
`hoodie.table.partition.fields` ?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -530,6 +539,40 @@ object ProvidesHoodieConfig {
   filterNullValues(overridingOpts)
   }
 
+  /**
+   * @param tableConfigKeyGeneratorClassName key generator class name in 
the table config.
+   * @param partitionFieldNamesWithoutKeyGenType partition field names without 
key generator types
+   * from the table config.
+   * @param catalogTable HoodieCatalogTable instance 
to fetch table properties.
+   * @return the write config value to set for 
"hoodie.datasource.write.partitionpath.field".
+   */
+  def getPartitionPathFieldWriteConfig(tableConfigKeyGeneratorClassName: 
String,
+   partitionFieldNamesWithoutKeyGenType: 
String,
+   catalogTable: HoodieCatalogTable): 
String = {
+if (StringUtils.isNullOrEmpty(tableConfigKeyGeneratorClassName)) {
+  partitionFieldNamesWithoutKeyGenType
+} else {
+  val writeConfigPartitionField = 
catalogTable.catalogProperties.get(PARTITIONPATH_FIELD.key())
+  val keyGenClass = 
ReflectionUtils.getClass(tableConfigKeyGeneratorClassName)
+  if (classOf[CustomKeyGenerator].equals(keyGenClass)

Review Comment:
   Do we want to make this cover any classes that extend customkeygen as well?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -528,6 +536,40 @@ object ProvidesHoodieConfig {
   filterNullValues(overridingOpts)
   }
 
+  /**
+   * @param tableConfigKeyGeneratorClassName key generator class name in 
the table config.
+   * @param partitionFieldNamesWithoutKeyGenType partition field names without 
key generator types
+   * from the table config.
+   * @param catalogTable HoodieCatalogTable instance 
to fetch table properties.
+   * @return the write config value to set for 
"hoodie.datasource.write.partitionpath.field".
+   */
+  def getPartitionPathFieldWriteConfig(tableConfigKeyGeneratorClassName: 
String,
+   partitionFieldNamesWithoutKeyGenType: 
String,
+   catalogTable: HoodieCatalogTable): 
String = {
+if (StringUtils.isNullOrEmpty(tableConfigKeyGeneratorClassName)) {
+  partitionFieldNamesWithoutKeyGenType
+} else {

Review Comment:
   So does this mean that it's still an issue for flink and hive etc?



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlWithCustomKeyGenerator.scala:
##
@@ -0,0 +1,571 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.HoodieSparkUtils
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.table.HoodieTableMetaClien

Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2051806863

   
   ## CI report:
   
   * e7dde68f9c2bda3e1045d3bcda6c2472072395a0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23218)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2051793682

   
   ## CI report:
   
   * e7dde68f9c2bda3e1045d3bcda6c2472072395a0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7609] Support array field type whose element type can be nullable [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11006:
URL: https://github.com/apache/hudi/pull/11006#issuecomment-2051780035

   
   ## CI report:
   
   * 33451d51be0e7999695483b980aba6d57052bf1b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23217)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Rollback failed clustering 0.12.2 [hudi]

2024-04-12 Thread via GitHub


VitoMakarevich commented on issue #10964:
URL: https://github.com/apache/hudi/issues/10964#issuecomment-2051734341

   I managed to do it with
   
[hoodie.clustering.updates.strategy](https://hudi.apache.org/docs/configurations/#hoodieclusteringupdatesstrategy)
 -> org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy 
(non-default)
   
[hoodie.clustering.rollback.pending.replacecommit.on.conflict](https://hudi.apache.org/docs/configurations/#hoodieclusteringrollbackpendingreplacecommitonconflict)
 -> true(non-default)
   
   The precondition is that your write should affect clustered partitions, 
otherwise nothing will happen.
   
   Unfortunately, I don't see any other way to do it(without copypasting some 
Hudi internals which looks risky for many users).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-12 Thread via GitHub


the-other-tim-brown commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2051712705

   @danny0405 https://github.com/apache/hudi/pull/11008


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-12 Thread via GitHub


the-other-tim-brown opened a new pull request, #11008:
URL: https://github.com/apache/hudi/pull/11008

   ### Change Logs
   
   Makes the ordering deterministic to get consistent results and avoid any 
issues in tests
   
   ### Impact
   
   Small file selection is consistent (mostly helps tests be consistent)
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-12 Thread via GitHub


the-other-tim-brown commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2051701476

   @danny0405 error is:
   ```
   
TestUpsertPartitioner.testUpsertPartitionerWithSmallFileHandlingPickingMultipleCandidates:470
 expected: <[BucketInfo {bucketType=UPDATE, fileIdPrefix=fg-1, 
partitionPath=2016/03/15}, BucketInfo {bucketType=UPDATE, fileIdPrefix=fg-2, 
partitionPath=2016/03/15}, BucketInfo {bucketType=UPDATE, fileIdPrefix=fg-3, 
partitionPath=2016/03/15}]> but was: <[BucketInfo {bucketType=UPDATE, 
fileIdPrefix=fg-3, partitionPath=2016/03/15}, BucketInfo {bucketType=UPDATE, 
fileIdPrefix=fg-2, partitionPath=2016/03/15}, BucketInfo {bucketType=UPDATE, 
fileIdPrefix=fg-1, partitionPath=2016/03/15}]>
   
   ```
   I'll put up a separate minor pr to make the ordering deterministic for small 
file handling


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7608] Fix Flink table creation configuration not taking effect when writing… [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11005:
URL: https://github.com/apache/hudi/pull/11005#issuecomment-2051683072

   
   ## CI report:
   
   * c0ca195bf69614784e60bd51d300df04a61fdf21 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23216)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT]After compacting, there are a large number of logs with size 0, and they can never be cleared. [hudi]

2024-04-12 Thread via GitHub


MrAladdin opened a new issue, #11007:
URL: https://github.com/apache/hudi/issues/11007

   **Describe the problem you faced**
   1、spark structured streaming : upsert mor (record_index)
   2、After compacting, there are a large number of logs with size 0, and they 
can never be cleared.
   
   
   **Environment Description**
   
   * Hudi version :0.14.1
   
   * Spark version :3.4.1
   
   * Hive version :3.1.2
   
   * Hadoop version :3.1.3
   
   * Storage (HDFS/S3/GCS..) :hdfs
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   .writeStream
 .format("hudi")
 .option("hoodie.table.base.file.format", "PARQUET")
 .option("hoodie.allow.empty.commit", "true")
 .option("hoodie.datasource.write.drop.partition.columns","false")
 .option("hoodie.table.services.enabled", "true")
 .option("hoodie.datasource.write.streaming.checkpoint.identifier", 
"lakehouse-dwd-social-kbi-beauty-v1-writer-1")
 .option(PRECOMBINE_FIELD.key(), "date_kbiUdate")
 .option(RECORDKEY_FIELD.key(), "records_key")
 .option(PARTITIONPATH_FIELD.key(), "partition_index_date")
 .option(DataSourceWriteOptions.OPERATION.key(), 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.TABLE_TYPE.key(), 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
 .option("hoodie.combine.before.upsert", "true")
 
.option("hoodie.datasource.write.payload.class","org.apache.hudi.common.model.OverwriteWithLatestAvroPayload")
   
 //markers
 .option("hoodie.write.markers.type", "DIRECT")
   
 //timeline server
 .option("hoodie.embed.timeline.server", "true")
   
 //File System View Storage Configurations
 .option("hoodie.filesystem.view.remote.timeout.secs", "1200")
 .option("hoodie.filesystem.view.remote.retry.enable", "true")
 .option("hoodie.filesystem.view.remote.retry.initial_interval_ms", 
"500")
 .option("hoodie.filesystem.view.remote.retry.max_numbers", "15")
 .option("hoodie.filesystem.view.remote.retry.max_interval_ms", 
"8000")
   
 //schema cache
 .option("hoodie.schema.cache.enable", "true")
   
 //spark write
 .option("hoodie.datasource.write.streaming.ignore.failed.batch", 
"false")
 .option("hoodie.datasource.write.streaming.retry.count", "6")
 .option("hoodie.datasource.write.streaming.retry.interval.ms", 
"3000")
   
 //metadata
 .option("hoodie.metadata.enable", "true")
 .option("hoodie.metadata.index.async", "false")
 .option("hoodie.metadata.index.check.timeout.seconds", "900")
 .option("hoodie.auto.adjust.lock.configs", "true")
 .option("hoodie.metadata.optimized.log.blocks.scan.enable", "true")
 .option("hoodie.metadata.index.column.stats.enable", "false")
 .option("hoodie.metadata.index.column.stats.parallelism", "100")
 .option("hoodie.metadata.index.column.stats.file.group.count", "4")
 
.option("hoodie.metadata.index.column.stats.column.list","date_udate,date_publishedAt")
 .option("hoodie.metadata.compact.max.delta.commits", "10")
   
   
 //metadata
 .option("hoodie.metadata.record.index.enable", "true")
 .option("hoodie.index.type", "RECORD_INDEX")
 .option("hoodie.metadata.max.init.parallelism", "10")
 .option("hoodie.metadata.record.index.min.filegroup.count", "10")
 .option("hoodie.metadata.record.index.max.filegroup.count", 
"1")
 .option("hoodie.metadata.record.index.max.filegroup.size", 
"1073741824")
 .option("hoodie.metadata.auto.initialize", "true")
 .option("hoodie.metadata.record.index.growth.factor", "2.0")
 .option("hoodie.metadata.max.logfile.size", "2147483648")
 .option("hoodie.metadata.log.compaction.enable", "false")
 .option("hoodie.metadata.log.compaction.blocks.threshold", "5")
 .option("hoodie.metadata.max.deltacommits.when_pending", "1000")
   
 //file size
 .option("hoodie.parquet.field_id.write.enabled", "true")
 .option("hoodie.copyonwrite.insert.auto.split", "true")
 .option("hoodie.record.size.estimation.threshold", "1.0")
 .option("hoodie.parquet.block.size", "536870912")
 .option("hoodie.parquet.max.file.size", "536870912")
 .option("hoodie.parquet.small.file.limit", "314572800")
 .option("hoodie.logfile.max.size", "536870912")
 .option("hoodie.logfile.data.block.max.size", "536870912")
 .option("hoodie.logfile.to.parquet.compression.ratio", "0.35")
   
 //archive
 .option("hoodie.

Re: [PR] [HUDI-7609] Support array field type whose element type can be nullable [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11006:
URL: https://github.com/apache/hudi/pull/11006#issuecomment-2051614975

   
   ## CI report:
   
   * 33451d51be0e7999695483b980aba6d57052bf1b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23217)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7609) Spark cannot write the hudi table containing array type created by flink

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7609:
-
Labels: pull-request-available  (was: )

> Spark cannot write the hudi table containing array type created by flink
> 
>
> Key: HUDI-7609
> URL: https://issues.apache.org/jira/browse/HUDI-7609
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
>
> When flink creates a Hudi table containing an array field, the elements of 
> the default array field cannot be nullable. When using Spark SQL to read data 
> from the Hive table to the Hudi table, a field verification exception will 
> occur.
> {code:java}
> 2024-03-27 12:47:51 INFO 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'decentral_level1
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:138)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.types.StructType$.$anonfun$fromAttributes$1(StructType.scala:549)
> 2024-03-27 12:47:51 INFO at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> 2024-03-27 12:47:51 INFO at 
> scala.collection.immutable.List.foreach(List.scala:392)
> 2024-03-27 12:47:51 INFO at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238)
> 2024-03-27 12:47:51 INFO at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> 2024-03-27 12:47:51 INFO at 
> scala.collection.immutable.List.map(List.scala:298)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:549)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:281)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:281)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignQueryOutput(InsertIntoHoodieTableCommand.scala:153)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:105)
> 2024-03-27 12:47:51 INFO at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7609] Support array field type whose element type can be nullable [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11006:
URL: https://github.com/apache/hudi/pull/11006#issuecomment-2051604069

   
   ## CI report:
   
   * 33451d51be0e7999695483b980aba6d57052bf1b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Issue with Repartition on Kafka Input DataFrame and Same Precombine Value Rows In One Batch [hudi]

2024-04-12 Thread via GitHub


ad1happy2go commented on issue #10995:
URL: https://github.com/apache/hudi/issues/10995#issuecomment-2051561910

   @brightwon Yes changing precombining key will not be allowed.  I do 
understand you trying to repartition to scale the tagging stage. You can try 
repartition on record key and see then if it gives consistent result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7609) Spark cannot write the hudi table containing array type created by flink

2024-04-12 Thread Jira
陈磊 created HUDI-7609:


 Summary: Spark cannot write the hudi table containing array type 
created by flink
 Key: HUDI-7609
 URL: https://issues.apache.org/jira/browse/HUDI-7609
 Project: Apache Hudi
  Issue Type: Bug
Reporter: 陈磊


When flink creates a Hudi table containing an array field, the elements of the 
default array field cannot be nullable. When using Spark SQL to read data from 
the Hive table to the Hudi table, a field verification exception will occur.
{code:java}
2024-03-27 12:47:51 INFO 
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'decentral_level1
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:138)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.types.StructType$.$anonfun$fromAttributes$1(StructType.scala:549)
2024-03-27 12:47:51 INFO at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
2024-03-27 12:47:51 INFO at 
scala.collection.immutable.List.foreach(List.scala:392)
2024-03-27 12:47:51 INFO at 
scala.collection.TraversableLike.map(TraversableLike.scala:238)
2024-03-27 12:47:51 INFO at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231)
2024-03-27 12:47:51 INFO at scala.collection.immutable.List.map(List.scala:298)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:549)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:281)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:281)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignQueryOutput(InsertIntoHoodieTableCommand.scala:153)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:105)
2024-03-27 12:47:51 INFO at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] Support array field type whose element type can be nullable [hudi]

2024-04-12 Thread via GitHub


empcl opened a new pull request, #11006:
URL: https://github.com/apache/hudi/pull/11006

   ### Change Logs
   
   _Support array field type whose element type can be nullable._
   
   ### Impact
   
   _none._
   
   ### Risk level (write none, low medium or high below)
   
   _none._
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7608] Fix Flink table creation configuration not taking effect when writing… [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11005:
URL: https://github.com/apache/hudi/pull/11005#issuecomment-2051533385

   
   ## CI report:
   
   * c0ca195bf69614784e60bd51d300df04a61fdf21 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23216)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7608] Fix Flink table creation configuration not taking effect when writing… [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11005:
URL: https://github.com/apache/hudi/pull/11005#issuecomment-2051522148

   
   ## CI report:
   
   * c0ca195bf69614784e60bd51d300df04a61fdf21 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7608] Fix Flink table creation configuration not taking effect when writing… [hudi]

2024-04-12 Thread via GitHub


empcl opened a new pull request, #11005:
URL: https://github.com/apache/hudi/pull/11005

   … to Spark
   
   ### Change Logs
   
   Fix Flink table creation configuration not taking effect when writing to 
Spark
   
   ### Impact
   
   _none._
   
   ### Risk level (write none, low medium or high below)
   
   _none._
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7608) Flink table creation configuration not taking effect when writing to Spark

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7608:
-
Labels: pull-request-available  (was: )

> Flink table creation configuration not taking effect when writing to Spark
> --
>
> Key: HUDI-7608
> URL: https://issues.apache.org/jira/browse/HUDI-7608
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
>
> When spark writes data, it uses the default index instead of specifying the 
> index when creating the table.
> flink create table:
> {code:java}
> create table if not exists hudi_catalog.source1.tb_1
> (
>   ...
> ) partitioned by (`f1`) with (
> ...
>   'index.type' = 'BUCKET',
>   'hoodie.bucket.index.num.buckets' = '4'
> ); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7608) Flink table creation configuration not taking effect when writing to Spark

2024-04-12 Thread Jira
陈磊 created HUDI-7608:


 Summary: Flink table creation configuration not taking effect when 
writing to Spark
 Key: HUDI-7608
 URL: https://issues.apache.org/jira/browse/HUDI-7608
 Project: Apache Hudi
  Issue Type: Bug
Reporter: 陈磊


When spark writes data, it uses the default index instead of specifying the 
index when creating the table.

flink create table:
{code:java}
create table if not exists hudi_catalog.source1.tb_1
(
  ...
) partitioned by (`f1`) with (
...
  'index.type' = 'BUCKET',
  'hoodie.bucket.index.num.buckets' = '4'
); {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT]insert_overwrite_table table slow [hudi]

2024-04-12 Thread via GitHub


wkhappy1 commented on issue #10979:
URL: https://github.com/apache/hudi/issues/10979#issuecomment-2051443367

   @ad1happy2go i try  insert_overwrite_table append with test data, find it 
has two rdd cache in memory
   
   https://github.com/apache/hudi/assets/54095696/279b3ebe-9334-4668-9ace-ca9159b5587a";>
   
   
   bulk_insert overwrite with test data, find it has two rdd cache in 
memory,size is small
   https://github.com/apache/hudi/assets/54095696/39c0920b-eee5-460a-b0cb-9454fb9de8b3";>
   
   test data has 1345493 rows,and 178 columns
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11000:
URL: https://github.com/apache/hudi/pull/11000#issuecomment-2051442199

   
   ## CI report:
   
   * 6c81f312f55df6d28363cb836202aa8ec7173a3e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23215)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] StreamWriteFunction support Exectly-Once in Flink ? [hudi]

2024-04-12 Thread via GitHub


seekforshell opened a new issue, #11004:
URL: https://github.com/apache/hudi/issues/11004

   **Describe the problem you faced**
   
   flink1.14.3 + hudi 0.12.1 
   when i use
   org.apache.hudi.sink.StreamWriteFunction in flink stream job, if 
jobmanager.execution.failover-strategy, region is set, it will be lost data? 
because this function has no state to restore ?
   
   **To Reproduce**
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   0.12.1
   * Hadoop version :
   3.1.1
   * Storage (HDFS/S3/GCS..) :
   HDFS
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


nsivabalan commented on code in PR #11000:
URL: https://github.com/apache/hudi/pull/11000#discussion_r1562262882


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -268,6 +268,7 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commitCallback.call(new HoodieWriteCommitCallbackMessage(
   instantTime, config.getTableName(), config.getBasePath(), stats, 
Option.of(commitActionType), extraMetadata));
 }
+releaseResources(instantTime);

Review Comment:
   this covers all writers correct? spark ds writer, and deltastreamer? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -268,6 +268,7 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commitCallback.call(new HoodieWriteCommitCallbackMessage(
   instantTime, config.getTableName(), config.getBasePath(), stats, 
Option.of(commitActionType), extraMetadata));
 }
+releaseResources(instantTime);

Review Comment:
   this code path covers all writers correct? spark ds writer, and 
deltastreamer? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11000:
URL: https://github.com/apache/hudi/pull/11000#issuecomment-2051329377

   
   ## CI report:
   
   * 12cf06d732847bf9ca925bf2bb4e2e0eb39b8855 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23205)
 
   * 6c81f312f55df6d28363cb836202aa8ec7173a3e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23215)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #11000:
URL: https://github.com/apache/hudi/pull/11000#issuecomment-2051316642

   
   ## CI report:
   
   * 12cf06d732847bf9ca925bf2bb4e2e0eb39b8855 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23205)
 
   * 6c81f312f55df6d28363cb836202aa8ec7173a3e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


rmahindra123 commented on code in PR #11000:
URL: https://github.com/apache/hudi/pull/11000#discussion_r1562199875


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -236,8 +236,8 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
+  releaseResources(instantTime);

Review Comment:
   makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


nsivabalan commented on code in PR #11000:
URL: https://github.com/apache/hudi/pull/11000#discussion_r1562194553


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -236,8 +236,8 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
+  releaseResources(instantTime);

Review Comment:
   gotcha. then shouldn't we try to do the same at L255



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


rmahindra123 commented on code in PR #11000:
URL: https://github.com/apache/hudi/pull/11000#discussion_r1562176259


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -236,8 +236,8 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
+  releaseResources(instantTime);

Review Comment:
   we want to release resources after table services. I added it to the catch 
block in case of early exit. Makes sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


rmahindra123 commented on code in PR #11000:
URL: https://github.com/apache/hudi/pull/11000#discussion_r1562176259


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -236,8 +236,8 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
+  releaseResources(instantTime);

Review Comment:
   we want to release resources after table services



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-12 Thread via GitHub


nsivabalan commented on code in PR #11000:
URL: https://github.com/apache/hudi/pull/11000#discussion_r1562161402


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -236,8 +236,8 @@ public boolean commitStats(String instantTime, 
HoodieData writeStat
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
+  releaseResources(instantTime);

Review Comment:
   why not move releaseResources to finally block ? and so we can avoid L 260 
right ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2051187608

   
   ## CI report:
   
   * 1f421909625781304a531ccadcbf6a37ca5185a4 UNKNOWN
   * d8dda49ff97feca5172346047aacb007746568ae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23214)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-12 Thread via GitHub


hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2051187027

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * d9f583043f1a5ffd532d613b2ce95aa7a8fddc47 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23213)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



<    1   2