Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on PR #8062: URL: https://github.com/apache/hudi/pull/8062#issuecomment-1837977625 > @geserdugarov Thanks very much for your review! I think the most important part of the design to be confirmed is whether we need to provide a feature-rich but complicated TTL policy in the first place or just implement a simple but extensible policy. > > @geserdugarov @codope @nsivabalan @vinothchandar Hope for your opinion for this ~ Yes, you're right about the main difference in our ideas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413459990 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions und
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413457079 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions und
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413454783 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions und
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413453307 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions und
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837964188 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1413451686 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model.HoodieMetadataColumnStats +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieRecord} +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.metadata.{HoodieTableMetadata, HoodieTableMetadataUtil} +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.JavaConversions.asScalaBuffer +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter, seqAsJavaListConverter} + +/** + * Calculate the degree of overlap between column stats. + * + * The overlap represents the extent to which the min-max ranges cover each other. Review Comment: The suggested doc format for new paragraph is: ```java The overlap represents t ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413452143 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions und
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1413446855 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -278,6 +293,19 @@ public static HoodieColumnRangeMetadata convertColumnStatsRecordToCo columnStats.getTotalUncompressedSize()); } + public static Option getColumnStatsValueAsString(Object statsValue) { +if (statsValue == null) { + System.out.println("Invalid value: " + statsValue); Review Comment: Log instead of print in production code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
danny0405 commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837959640 hoodie.metadata.table -> hoodie.metadata.enable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. Review Comment: Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, provides only `commitTime` and `partitionDepth` properties. This file is written during partition creation and is not updated later. We can make this file more usable by adding a new property, for instance, `lastUpdateTime` for saving time of the last upsert/delete operation on the partition with each commit/deltacommit/replacecommit. And use this property to handle for Partition TTL check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. Review Comment: Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, provides only commitTime and partitionDepth properties. This file is written during partition creation and is not updated later. We can make this file more usable by adding a new property, for instance, `lastUpdateTime` for saving time of the last upsert/delete operation on the partition with each commit/deltacommit/replacecommit. And use this property to handle for Partition TTL check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. Review Comment: Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, provides only commitTime, commit time of partition creation, and partitionDepth properties. This file is written during partition creation and is not updated later. We can add new property, for instance, `lastUpdateTime` to save time of the last upsert/delete operation on the partition with each commit/deltacommit/replacecommit. And use this property to handle for Partition TTL check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837955494 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837949851 @danny0405 why is back to GLOBAL_SIMPLE? ![image](https://github.com/apache/hudi/assets/15028279/20107e0d-46eb-4e28-9a5a-0fc8750cbc34) 23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837946117 @danny0405 why is back to GLOBAL_SIMPLE? https://github.com/apache/hudi/assets/15028279/9cddf011-e25c-4c0f-9b40-c2d7fdd17cf9";> 23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413436193 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. Review Comment: > The TTL policy will default to KEEP_BY_TIME and we can automatically detect expired partitions by their last modified time and delete them. Unfortunately, I don't see any other strategies except KEEP_BY_TIME for partition level TTL. ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users stil
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413434722 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME Review Comment: I suppose it's better to implement only one processing for Partition TTL with accounting time only. For KEEP_BY_SIZE partition level looks not suitable. It's appropriate for record level processing. So, we don't need this setting: `hoodie.partition.ttl.strategy`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413434722 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME Review Comment: I suppose it's better to implement only one processing for Partition TTL with accounting time only. For KEEP_BY_SIZE partition level looks not suitable. It's appropriate for record level processing. So, we don't need `hoodie.partition.ttl.strategy` setting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716844 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext Review Comment: should fix the import sequence, you can reference this checkstyle: https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Comparison between defaultParName and partValue [hudi]
danny0405 commented on code in PR #10234: URL: https://github.com/apache/hudi/pull/10234#discussion_r1413431893 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PartitionPruners.java: ## @@ -94,7 +94,7 @@ private boolean evaluate(String partition) { Map partStats = new LinkedHashMap<>(); for (int idx = 0; idx < partitionKeys.length; idx++) { String partKey = partitionKeys[idx]; -Object partVal = partKey.equals(defaultParName) +Object partVal = partStrArray[idx].equals(defaultParName) Review Comment: The partKey is never used, maybe just switch it to ` String partValStr= partStrArray[idx];` and use this variable then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
danny0405 commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413428497 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: Hmm, maybe we just fix the table type as to be in line with the `hoodie.properties` when there is inconsistency instead of throwing, WDYT ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
danny0405 commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413426788 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: Then why the reader expects a `MOR` table if the table type was not specified by the table declaration? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413426165 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: It caused error when executing ` List rows2 = execSelectSql(streamTableEnv, "select * from t1", 10);`. For t1 hudi table : table options in memorycatalog is mor, but the hoodie.property is cow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
majian1998 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1413424885 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieRecord} +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.metadata.HoodieTableMetadata +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.JavaConversions.asScalaBuffer +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter, seqAsJavaListConverter} + +/** + * Calculate the degree of overlap between column stats. + * The overlap represents the extent to which the min-max ranges cover each other. + * By referring to the overlap, we can visually demonstrate the degree of data skipping + * for different columns under the current table's data layout. + * The calculation is performed at the partition level (assuming that data skipping is based on partition pruning). + * + * For example, consider three files: a.parquet, b.parquet, and c.parquet. + * Taking an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, + * for 'b' is 3–7, and for 'c' is 7–8. This results in their values overlapping on the coordinate axis as follows: + * Value Range: 1 2 3 4 5 6 7 8 + * a.parquet: [---] + * b.parquet: [] + * c.parquet: [-] + * Thus, there will be overlap within the ranges 3–5 and 7. + * If the filter conditions for 'id' during data skipping include these values, + * multiple files will be filtered out. For a simpler case, if it's an equality query, + * 2 files will be filtered within these ranges, and no more than one file will be filtered in other cases (possibly outside of the range). + * + * Additionally, calculating the degree of overlap based solely on the maximum values + * may not provide sufficient information. Therefore, we sample and calculate the overlap degree + * for all values involved in the min-max range. We also compute the degree of overlap + * at different percentiles and tally the count of these values.An example of a result is as follows: + * |Partition path |Field name |Average overlap |Maximum file overlap |Total file number |50% overlap|75% overlap|95% overlap|99% overlap|Total value number | + * -- + * |path |c8 |1.33 |2 |2 |1 |1 |1 |1 |3 | + + */ +class ShowColumnStatsOverlapProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "partition", DataTypes.StringType), +ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField(
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: The test error log: ``` org.apache.flink.table.api.ValidationException: Unable to create a source for reading table 'default_catalog.default_database.t1'. Table options are: 'connector'='hudi' 'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700' 'read.streaming.enabled'='true' 'table.type'='MERGE_ON_READ' at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219) at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115) at org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997) at org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205) at org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69) at org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48) at org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262) at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106) at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728) at org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166) at org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) at
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: The test error log: ``` org.apache.flink.table.api.ValidationException: Unable to create a source for reading table 'default_catalog.default_database.t1'. Table options are: 'connector'='hudi' 'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700' 'read.streaming.enabled'='true' 'table.type'='MERGE_ON_READ' at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219) at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115) at org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997) at org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205) at org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69) at org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48) at org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262) at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106) at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728) at org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166) at org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) at
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: The test error log: ``` org.apache.flink.table.api.ValidationException: Unable to create a source for reading table 'default_catalog.default_database.t1'. Table options are: 'connector'='hudi' 'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700' 'read.streaming.enabled'='true' 'table.type'='MERGE_ON_READ' at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219) at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115) at org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997) at org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205) at org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69) at org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48) at org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262) at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106) at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728) at org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166) at org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) at
Re: [PR] Comparison between defaultParName and partValue [hudi]
hudi-bot commented on PR #10234: URL: https://github.com/apache/hudi/pull/10234#issuecomment-1837908541 ## CI report: * a21ab0df2ee9e6a24219a6973624784a37017d00 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21286) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove opened a new issue, #10235: URL: https://github.com/apache/hudi/issues/10235 **Describe the problem you faced** The spark job is too slow in follow stage. Adjusting CPU, memory, and concurrency has no effect. Which stage can be optimized or skipped? ![image](https://github.com/apache/hudi/assets/15028279/e4122bc3-e02b-4f01-9010-737300b85bed) Is this normal? Why still use HoodieGlobalSimpleIndex? ![image](https://github.com/apache/hudi/assets/15028279/89cb305f-bc23-40a7-ac00-0adab5933b53) **To Reproduce** Steps to reproduce the behavior: 1. table config ``` CREATE TABLE if NOT EXISTS bi_dw_real.smart_datapoint_report_rw_clear_rt( id STRING COMMENT 'id', uuid STRING COMMENT 'log uuid', data_id STRING COMMENT '', dev_id STRING COMMENT '', gw_id STRING COMMENT '', product_id STRING COMMENT '', uid STRING COMMENT '', dp_code STRING COMMENT '', dp_id STRING COMMENT '', dp_mode STRING COMMENT ', dp_name STRING COMMENT '', dp_time STRING COMMENT '', dp_type STRING COMMENT '', dp_value STRING COMMENT '', gmt_modified BIGINT COMMENT 'ct 时间', dt STRING COMMENT '时间分区字段' ) using hudi PARTITIONED BY (dt,dp_mode) COMMENT '' location '${bi_db_dir}/bi_ods_real/ods_smart_datapoint_report_rw_clear_rt' tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'gmt_modified', hoodie.combine.before.upsert='false', hoodie.metadata.record.index.enable='true', hoodie.datasource.write.operation='upsert', hoodie.metadata.table='true', hoodie.datasource.write.hive_style_partitioning='true', hoodie.metadata.record.index.min.filegroup.count ='512', hoodie.index.type='RECORD_INDEX', hoodie.compact.inline='false', hoodie.common.spillable.diskmap.type='ROCKS_DB', hoodie.datasource.write.partitionpath.field='dt,dp_mode', hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload' ) ; set hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt; set hoodie.storage.layout.type=DEFAULT; set hoodie.metadata.record.index.enable=true; set hoodie.metadata.table=true; set hoodie.populate.meta.fields=false; set hoodie.parquet.compression.codec=snappy; set hoodie.memory.merge.max.size=200485760; set hoodie.write.buffer.limit.bytes=419430400; set hoodie.index.type=RECORD_INDEX; ``` 3.insert into bi_dw_real.smart_datapoint_report_rw_clear_rt **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.14.0 * Spark version :3.2.1 * Hive version :3.1.3 * Hadoop version :3.2.2 * Storage (HDFS/S3/GCS..) :s3 * Running on Docker? (yes/no) :no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Comparison between defaultParName and partValue [hudi]
hudi-bot commented on PR #10234: URL: https://github.com/apache/hudi/pull/10234#issuecomment-1837901855 ## CI report: * a21ab0df2ee9e6a24219a6973624784a37017d00 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: The test error log: ``` org.apache.flink.table.api.ValidationException: Unable to create a source for reading table 'default_catalog.default_database.t1'. Table options are: 'connector'='hudi' 'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700' 'read.streaming.enabled'='true' 'table.type'='MERGE_ON_READ' at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219) at org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175) at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115) at org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997) at org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341) at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723) at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843) at org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229) at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205) at org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69) at org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48) at org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353) at org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262) at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106) at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728) at org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180) at org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166) at org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) at
[PR] Comparison between defaultParName and partValue [hudi]
hehuiyuan opened a new pull request, #10234: URL: https://github.com/apache/hudi/pull/10234 ### Change Logs Comparison between defaultParName and partValue ### Impact ### Risk level (write none, low medium or high below) ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]
Sarfaraz-214 commented on issue #10233: URL: https://github.com/apache/hudi/issues/10233#issuecomment-1837838654 Hi @Amar1404 - I do not want to have a pre-combine key. I want to dump all the data I get as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]
Amar1404 commented on issue #10233: URL: https://github.com/apache/hudi/issues/10233#issuecomment-1837836791 Hi, @Sarfaraz-214 - Automatically the HoodieStreamer take hoodie.datasource.write.precombine.field default value as ts is taken, you need to pass value in properties file for the property **hoodie.datasource.write.precombine.field**, by using that it takes precedence while selecting two records having similar primary keys -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]
Sarfaraz-214 opened a new issue, #10233: URL: https://github.com/apache/hudi/issues/10233 I am using HoodieStreamer with **Hudi 0.14** and trying to leverage [autogenerated keys](https://hudi.apache.org/releases/release-0.14.0/#support-for-hudi-tables-with-autogenerated-keys). Hence I am not passing **hoodie.datasource.write.recordkey.field** & **hoodie.datasource.write.precombine.field** . Additionally, I am passing **hoodie.spark.sql.insert.into.operation = insert** (instead of --op insert) which claims that there is no pre-combine key with bulk_insert and insert mode. With above the **.hoodie** directory gets created but the data write to GCS fails with error - ``` org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[c1, c2, c3, c4, c5] at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:601) ``` I also see in **hoodie.properties** file pre-combine key is getting set to **ts** (hoodie.table.precombine.field=ts). Seems like this is getting set due to default value of **--source-ordering-field** . How can we skip the pre-combine field in this case? This is happening for both CoW & MoR tables. Actually this is running fine via Spark-SQL, but while using HoodieStreamer I am facing the issue. Sharing the configurations used: **hudi-table.properties** ``` hoodie.datasource.write.partitionpath.field=job_id hoodie.spark.sql.insert.into.operation=insert bootstrap.servers=*** security.protocol=SASL_SSL sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username='***' password='***'; auto.offset.reset=earliest hoodie.deltastreamer.source.kafka.topic= hoodie.deltastreamer.schemaprovider.source.schema.file=gs:///.avsc hoodie.write.concurrency.mode=optimistic_concurrency_control hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider ``` **spark-submit command** ``` spark-submit \ --class org.apache.hudi.utilities.streamer.HoodieStreamer \ --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.0, \ --properties-file /home/sarfaraz_h/spark-config.properties \ --master yarn \ --deploy-mode cluster \ --driver-memory 12G \ --driver-cores 3 \ --executor-memory 12G \ --executor-cores 3 \ --num-executors 3 \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.sql.shuffle.partitions=18 \ gs:///jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \ --continuous \ --source-limit 100 \ --min-sync-interval-seconds 600 \ --table-type COPY_ON_WRITE \ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ --target-base-path gs:/// \ --target-table \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --props gs:///configfolder/es_user_profile_config.properties ``` Spark version used is - 3.3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
hudi-bot commented on PR #10230: URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837811537 ## CI report: * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284) * 683b087f5a248c47b3edb95ee92cef71d8c0f506 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21285) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]
jonathantransb opened a new issue, #10232: URL: https://github.com/apache/hudi/issues/10232 **Describe the problem you faced** I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with metadata enabled. However, my job appears to hang indefinitely at a certain step. Despite enabling DEBUG logs, I'm unable to find any indications of what may be causing this issue. Notably, this problem only occurs with Hudi tables where `clean` is the latest action in the timeline. **To Reproduce** Steps to reproduce the behavior: 1. Create a Hudi table where `clean` is the latest action in the timeline https://github.com/apache/hudi/assets/60864800/813c8190-8a47-48ed-8d8c-31dc2a29ff4b";> 2. Open spark-shell ```bash spark-shell \ --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf "spark.sql.parquet.filterPushdown=true" \ --conf "spark.sql.parquet.mergeSchema=false" \ --conf "spark.speculation=false" \ --conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \ --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ --conf "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \ --conf "spark.sql.catalogImplementation=hive" \ --conf "spark.sql.catalog.spark_catalog.type=hive" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \ --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \ --conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar" ``` 3. Run spark.sql(): ```bash scala> spark.sql("SET hoodie.metadata.enable=true") scala> spark.sql("SELECT * FROM . LIMIT 50").show() ``` **Expected behavior** Spark job can read the table without hanging **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : 2.3.9 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes **Additional context** I encountered no issues while using Hudi version 0.13.1. However, upon trying the new Hudi 0.14.0 version, I experienced this problem. For tables where `commit` is the latest action in the timeline, Hudi 0.14.0 can read the table without any hanging issues. https://github.com/apache/hudi/assets/60864800/e8375fe0-7858-4779-b397-dc23f006c7dc";> The driver pod consistently uses up to 1 CPU core, although I'm uncertain about the processes that are running: https://github.com/apache/hudi/assets/60864800/d11979f7-c4ac-4703-80dd-6e122152b8fb";> **Stacktrace** ``` 23/12/03 12:21:23 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site.xml 23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse 23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata : ap-southeast-1 23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata : ap-southeast-1 23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system started 23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2 23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file 23/12/03 12:21:28 INFO DataSourceUtils: Getting table path.. 23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path : s3: 23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path: s3: 23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3: 23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from s3:/.hoodie/hoodie.properties 23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3: 23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]} 23/12/03 12:21:28 INFO TableSchemaResol
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
hudi-bot commented on PR #10230: URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837806459 ## CI report: * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284) * 683b087f5a248c47b3edb95ee92cef71d8c0f506 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
ksmou commented on code in PR #10230: URL: https://github.com/apache/hudi/pull/10230#discussion_r1413331856 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() { assertThat("Recommits the instant with partial uncommitted events", lastCompleted, is(instant)); } + @Test + public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws Exception { +// reset +reset(); +// override the default configuration +Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +OperatorCoordinator.Context context = new MockOperatorCoordinatorContext(new OperatorID(), 1); +coordinator = new StreamWriteOperatorCoordinator(conf, context); +coordinator.start(); +coordinator.setExecutor(new MockCoordinatorExecutor(context)); + +coordinator.getWriteClient().getConfig() +.setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, HoodieFailedWritesCleaningPolicy.LAZY.name()); + assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy()); + +final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0); + +// start one instant and not commit it +coordinator.handleEventFromOperator(0, event0); +String instant = coordinator.getInstant(); +HoodieHeartbeatClient heartbeatClient = coordinator.getWriteClient().getHeartbeatClient(); +assertNotEquals(null, heartbeatClient.getHeartbeat(instant), "Heartbeat should not null"); + +String basePath = tempFile.getAbsolutePath(); +HoodieWrapperFileSystem fs = coordinator.getWriteClient().getHoodieTable().getMetaClient().getFs(); + +assertTrue(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), "Heartbeat is existed"); + +// send bootstrap event to stop the heartbeat for this instant +WriteMetadataEvent event1 = WriteMetadataEvent.emptyBootstrap(0); +coordinator.handleEventFromOperator(0, event1); + +assertFalse(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), "Heartbeat is stopped and not existed"); + } Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6822] Fix deletes handling in hbase index when partition path is updated [hudi]
flashJd commented on code in PR #9630: URL: https://github.com/apache/hudi/pull/9630#discussion_r1413320714 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1402,38 +1397,13 @@ private void fetchOutofSyncFilesRecordsFromMetadataTable(Map getRecordIndexUpdates(HoodieData writeStatuses) { -HoodiePairData recordKeyDelegatePairs = null; -// if update partition path is true, chances that we might get two records (1 delete in older partition and 1 insert to new partition) -// and hence we might have to do reduce By key before ingesting to RLI partition. -if (dataWriteConfig.getRecordIndexUpdatePartitionPath()) { - recordKeyDelegatePairs = writeStatuses.map(writeStatus -> writeStatus.getWrittenRecordDelegates().stream() - .map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), recordDelegate))) - .flatMapToPair(Stream::iterator) - .reduceByKey((recordDelegate1, recordDelegate2) -> { -if (recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) { - if (!recordDelegate1.getNewLocation().isPresent() && !recordDelegate2.getNewLocation().isPresent()) { -throw new HoodieIOException("Both version of records do not have location set. Record V1 " + recordDelegate1.toString() -+ ", Record V2 " + recordDelegate2.toString()); - } - if (recordDelegate1.getNewLocation().isPresent()) { -return recordDelegate1; - } else { -// if record delegate 1 does not have location set, record delegate 2 should have location set. -return recordDelegate2; - } -} else { - return recordDelegate1; -} - }, Math.max(1, writeStatuses.getNumPartitions())); -} else { - // if update partition path = false, we should get only one entry per record key. - recordKeyDelegatePairs = writeStatuses.flatMapToPair( - (SerializableFunction>>) writeStatus - -> writeStatus.getWrittenRecordDelegates().stream().map(rec -> Pair.of(rec.getRecordKey(), rec)).iterator()); -} -return recordKeyDelegatePairs -.map(writeStatusRecordDelegate -> { - HoodieRecordDelegate recordDelegate = writeStatusRecordDelegate.getValue(); +return writeStatuses.flatMap(writeStatus -> { + List recordList = new LinkedList<>(); + for (HoodieRecordDelegate recordDelegate : writeStatus.getWrittenRecordDelegates()) { +if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) { + if (recordDelegate.getIgnoreFlag()) { Review Comment: yeah, it's right, `we are setting the ignore flag only in indexing code and specifically when indexing could reutrn two version of record delegate.` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] In the Flink catalog scenario, solve the problem of using SimpleKeyGe… [hudi]
danny0405 commented on code in PR #10227: URL: https://github.com/apache/hudi/pull/10227#discussion_r1413316108 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java: ## @@ -248,6 +250,36 @@ public void testCreateTable() throws Exception { // test create exist table assertThrows(TableAlreadyExistException.class, () -> catalog.createTable(tablePath, EXPECTED_CATALOG_TABLE, false)); + +// validate key generator for partitioned table +HoodieTableMetaClient metaClient = HoodieTableMetaClient Review Comment: You can use the `StreamerUtil.createMetaClient` for the initialization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] In the Flink catalog scenario, solve the problem of using SimpleKeyGe… [hudi]
danny0405 commented on code in PR #10227: URL: https://github.com/apache/hudi/pull/10227#discussion_r1413315922 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java: ## @@ -248,6 +250,36 @@ public void testCreateTable() throws Exception { // test create exist table assertThrows(TableAlreadyExistException.class, () -> catalog.createTable(tablePath, EXPECTED_CATALOG_TABLE, false)); + +// validate key generator for partitioned table +HoodieTableMetaClient metaClient = HoodieTableMetaClient +.builder() +.setConf(new org.apache.hadoop.conf.Configuration()) +.setBasePath(catalog.inferTablePath(catalogPathStr, tablePath)) +.build(); +String keyGeneratorClassName = metaClient.getTableConfig().getKeyGeneratorClassName(); +assertEquals(keyGeneratorClassName, NonpartitionedAvroKeyGenerator.class.getName()); Review Comment: Why partitioned table uses `NonpartitionedAvroKeyGenerator` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]
danny0405 commented on code in PR #10157: URL: https://github.com/apache/hudi/pull/10157#discussion_r1413315255 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java: ## @@ -96,12 +96,17 @@ import org.slf4j.LoggerFactory; import java.io.IOException; +import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; +import static org.apache.flink.table.factories.FactoryUtil.CONNECTOR; +import static org.apache.flink.util.Preconditions.checkArgument; +import static org.apache.flink.util.Preconditions.checkNotNull; Review Comment: Maybe you should just import this checkstyle file: https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] support log index [hudi]
danny0405 commented on PR #10143: URL: https://github.com/apache/hudi/pull/10143#issuecomment-1837752363 > I would like to ask that why data with the same primary key is written to different log files (with the same FileId and different timestamps) in upsert mode? The primary lifecycle is maintained within one FileGroup, different log files may indicate multiple changes to one key which scattered among multiple commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7167] Resolve type conversion exception when Spark writes to Hudi and synch… [hudi]
danny0405 commented on code in PR #10229: URL: https://github.com/apache/hudi/pull/10229#discussion_r1413311630 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala: ## @@ -164,6 +161,20 @@ object CreateHoodieTableCommand { } } + private def convertFieldTypes(fieldTypes: StructType): StructType = { +val fields = new Array[StructField](fieldTypes.fields.length) +var index = 0 Review Comment: Should we just fix the hive sync schema instead? Modification in here affects beyond the Hive scope I guess. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
danny0405 commented on code in PR #10230: URL: https://github.com/apache/hudi/pull/10230#discussion_r1413310745 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() { assertThat("Recommits the instant with partial uncommitted events", lastCompleted, is(instant)); } + @Test + public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws Exception { +// reset +reset(); +// override the default configuration +Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +OperatorCoordinator.Context context = new MockOperatorCoordinatorContext(new OperatorID(), 1); +coordinator = new StreamWriteOperatorCoordinator(conf, context); +coordinator.start(); +coordinator.setExecutor(new MockCoordinatorExecutor(context)); + +coordinator.getWriteClient().getConfig() +.setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, HoodieFailedWritesCleaningPolicy.LAZY.name()); + assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy()); + +final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0); + +// start one instant and not commit it +coordinator.handleEventFromOperator(0, event0); +String instant = coordinator.getInstant(); +HoodieHeartbeatClient heartbeatClient = coordinator.getWriteClient().getHeartbeatClient(); +assertNotEquals(null, heartbeatClient.getHeartbeat(instant), "Heartbeat should not null"); + +String basePath = tempFile.getAbsolutePath(); +HoodieWrapperFileSystem fs = coordinator.getWriteClient().getHoodieTable().getMetaClient().getFs(); + +assertTrue(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), "Heartbeat is existed"); + +// send bootstrap event to stop the heartbeat for this instant +WriteMetadataEvent event1 = WriteMetadataEvent.emptyBootstrap(0); +coordinator.handleEventFromOperator(0, event1); + +assertFalse(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), "Heartbeat is stopped and not existed"); + } Review Comment: `Heartbeat is stopped and not existed` -> `Heartbeat is stopped and cleared` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
danny0405 commented on code in PR #10230: URL: https://github.com/apache/hudi/pull/10230#discussion_r1413310224 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() { assertThat("Recommits the instant with partial uncommitted events", lastCompleted, is(instant)); } + @Test + public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws Exception { +// reset +reset(); +// override the default configuration +Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +OperatorCoordinator.Context context = new MockOperatorCoordinatorContext(new OperatorID(), 1); +coordinator = new StreamWriteOperatorCoordinator(conf, context); +coordinator.start(); +coordinator.setExecutor(new MockCoordinatorExecutor(context)); + +coordinator.getWriteClient().getConfig() +.setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, HoodieFailedWritesCleaningPolicy.LAZY.name()); + assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy()); + +final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0); + +// start one instant and not commit it +coordinator.handleEventFromOperator(0, event0); +String instant = coordinator.getInstant(); +HoodieHeartbeatClient heartbeatClient = coordinator.getWriteClient().getHeartbeatClient(); +assertNotEquals(null, heartbeatClient.getHeartbeat(instant), "Heartbeat should not null"); + Review Comment: Use `assertNotNull`? And with better msg prompt: `Heartbeat is missing` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
danny0405 commented on code in PR #10230: URL: https://github.com/apache/hudi/pull/10230#discussion_r1413309609 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() { assertThat("Recommits the instant with partial uncommitted events", lastCompleted, is(instant)); } + @Test + public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws Exception { +// reset +reset(); +// override the default configuration +Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +OperatorCoordinator.Context context = new MockOperatorCoordinatorContext(new OperatorID(), 1); +coordinator = new StreamWriteOperatorCoordinator(conf, context); +coordinator.start(); +coordinator.setExecutor(new MockCoordinatorExecutor(context)); + +coordinator.getWriteClient().getConfig() Review Comment: Why not just set up as lazy cleaning in the `conf`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165] Flink multi writer not close the failed instant heartbeat [hudi]
ksmou commented on code in PR #10221: URL: https://github.com/apache/hudi/pull/10221#discussion_r1413290959 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java: ## @@ -420,6 +420,10 @@ private void initInstant(String instant) { } commitInstant(instant); } +// stop the heartbeat for old instant Review Comment: ut added. https://github.com/apache/hudi/pull/10230 @nsivabalan @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
soumilshah1995 commented on issue #10231: URL: https://github.com/apache/hudi/issues/10231#issuecomment-1837665661 When using this ``` hoodie.datasource.write.recordkey.field=order_id hoodie.datasource.write.partitionpath.field=order_date hoodie.streamer.source.dfs.root=file:Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders hoodie.datasource.write.precombine.field=ts hoodie.deltastreamer.csv.header=true hoodie.deltastreamer.csv.sep=\t hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.use_jdbc=false hoodie.datasource.hive_sync.mode=hms hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1 hoodie.datasource.hive_sync.database=default hoodie.datasource.hive_sync.table=orders hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.hive_sync.partition_fields=order_date hoodie.datasource.write.hive_style_partitioning=true ``` Now I am seeing this error ``` Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader jdk.internal.loader.ClassLoaders$AppClassLoader@67424e82, see the next exception for details. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 107 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.base/java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source) at java.base/java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.access.RAMAccessManager$5.run(Unknown Source) at java.base/java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.access.RAMAccessManager.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.db.BasicDatabase$5.run(Unknown Source) at java.base/java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.db.BasicDatabase.bootServiceModule(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source) at org.apa
[I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
soumilshah1995 opened a new issue, #10231: URL: https://github.com/apache/hudi/issues/10231 Hello everyone, I'm encountering a small issue that seems to be related to settings, and I would appreciate any guidance in identifying the problem. This pertains to my upcoming videos where I'm covering the Hudi Hive Sync tool in detail. I've started the Spark Thrift Server using the following command: ``` spark-submit \ --master 'local[*]' \ --conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \ --conf spark.eventLog.enabled=false \ --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \ --name "Thrift JDBC/ODBC Server" \ --executor-memory 512m \ --packages org.apache.spark:spark-hive_2.12:3.4.0 ``` Additionally, I have Beeline installed and connected to the default database: ``` beeline -u jdbc:hive2://localhost:1/default ``` While my delta stream works fine, it appears that I'm facing issues using it with the Hive MetaStore. Here's my Spark submit command for the Hudi Delta Streamer: ``` spark-submit \ --class org.apache.hudi.utilities.streamer.HoodieStreamer \ --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \ --repositories 'https://repo.maven.apache.org/maven2' \ --properties-file /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/spark-config.properties \ --master 'local[*]' \ --executor-memory 1g \ /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \ --table-type COPY_ON_WRITE \ --op UPSERT \ --enable-hive-sync \ --source-ordering-field ts \ --source-class org.apache.hudi.utilities.sources.CsvDFSSource \ --target-base-path file:///Users/soumilshah/Downloads/hudidb/ \ --target-table orders \ --props hudi_tbl.props ``` Hudi CONF ``` hoodie.datasource.write.recordkey.field=order_id hoodie.datasource.write.partitionpath.field=order_date hoodie.streamer.source.dfs.root=file:Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders hoodie.datasource.write.precombine.field=ts hoodie.deltastreamer.csv.header=true hoodie.deltastreamer.csv.sep=\t hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.mode=jdbc hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1 hoodie.datasource.hive_sync.database=default hoodie.datasource.hive_sync.table=orders hoodie.datasource.hive_sync.partition_fields=order_date ``` Spark Conf: ``` spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.sql.hive.convertMetastoreParquet=false ``` The error I'm encountering is: ``` Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables" org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables" at org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606) at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3385) ``` Any assistance in identifying what might be missing or misconfigured would be highly appreciated. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
hudi-bot commented on PR #10230: URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837597435 ## CI report: * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
hudi-bot commented on PR #10230: URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837556153 ## CI report: * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
hudi-bot commented on PR #10230: URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837554127 ## CI report: * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDi-7168][FOLLOW-UP] Add test case for HUDI-7165 [hudi]
ksmou opened a new pull request, #10230: URL: https://github.com/apache/hudi/pull/10230 ### Change Logs add ut ### Impact none ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7168) Add test cases for HUDI-7165
kwang created HUDI-7168: --- Summary: Add test cases for HUDI-7165 Key: HUDI-7168 URL: https://issues.apache.org/jira/browse/HUDI-7168 Project: Apache Hudi Issue Type: Improvement Reporter: kwang -- This message was sent by Atlassian Jira (v8.20.10#820010)