date:20231203

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on PR #8062:
URL: https://github.com/apache/hudi/pull/8062#issuecomment-1837977625

   > @geserdugarov Thanks very much for your review! I think the most important 
part of the design to be confirmed is whether we need to provide a feature-rich 
but complicated TTL policy in the first place or just implement a simple but 
extensible policy.
   > 
   > @geserdugarov @codope @nsivabalan @vinothchandar Hope for your opinion for 
this ~
   
   Yes, you're right about the main difference in our ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413459990


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413457079


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413454783


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413453307


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837964188

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1413451686


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model.HoodieMetadataColumnStats
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieRecord}
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.metadata.{HoodieTableMetadata, HoodieTableMetadataUtil}
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.JavaConversions.asScalaBuffer
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter, seqAsJavaListConverter}
+
+/**
+ * Calculate the degree of overlap between column stats.
+ * 
+ * The overlap represents the extent to which the min-max ranges cover each 
other.

Review Comment:
   The suggested doc format for new paragraph is:
   
   ```java
   
The overlap represents t 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413452143


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1413446855


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -278,6 +293,19 @@ public static HoodieColumnRangeMetadata 
convertColumnStatsRecordToCo
 columnStats.getTotalUncompressedSize());
   }
 
+  public static Option getColumnStatsValueAsString(Object statsValue) {
+if (statsValue == null) {
+  System.out.println("Invalid value: " + statsValue);

Review Comment:
   Log instead of print in production code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837959640

   hoodie.metadata.table -> hoodie.metadata.enable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, 
provides only `commitTime` and `partitionDepth` properties. This file is 
written during partition creation and is not updated later. We can make this 
file more usable by adding a new property, for instance, `lastUpdateTime` for 
saving time of the last upsert/delete operation on the partition with each 
commit/deltacommit/replacecommit. And use this property to handle for Partition 
TTL check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, 
provides only commitTime and partitionDepth properties. This file is written 
during partition creation and is not updated later. We can make this file more 
usable by adding a new property, for instance, `lastUpdateTime` for saving time 
of the last upsert/delete operation on the partition with each 
commit/deltacommit/replacecommit. And use this property to handle for Partition 
TTL check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, 
provides only commitTime, commit time of partition creation, and partitionDepth 
properties. This file is written during partition creation and is not updated 
later.
   
   We can add new property, for instance, `lastUpdateTime` to save time of the 
last upsert/delete operation on the partition with each 
commit/deltacommit/replacecommit. And use this property to handle for Partition 
TTL check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837955494

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub



zyclove commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837949851

   @danny0405  why is back to GLOBAL_SIMPLE?
   
![image](https://github.com/apache/hudi/assets/15028279/20107e0d-46eb-4e28-9a5a-0fc8750cbc34)
   
   23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not 
initialized so falling back to GLOBAL_SIMPLE for tagging records
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub



zyclove commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837946117

   @danny0405 
   why is back to GLOBAL_SIMPLE?
   https://github.com/apache/hudi/assets/15028279/9cddf011-e25c-4c0f-9b40-c2d7fdd17cf9;>
   
   23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not 
initialized so falling back to GLOBAL_SIMPLE for tagging records
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413436193


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.

Review Comment:
   > The TTL policy will default to KEEP_BY_TIME and we can automatically 
detect expired partitions by their last modified time and delete them.
   Unfortunately, I don't see any other strategies except KEEP_BY_TIME for 
partition level TTL.



##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413434722


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME

Review Comment:
   I suppose it's better to implement only one processing for Partition TTL 
with accounting time only. For KEEP_BY_SIZE partition level looks not suitable. 
It's appropriate for record level processing.
   So, we don't need this setting: `hoodie.partition.ttl.strategy`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-03 Thread via GitHub



geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413434722


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME

Review Comment:
   I suppose it's better to implement only one processing for Partition TTL 
with accounting time only. For KEEP_BY_SIZE partition level looks not suitable. 
It's appropriate for record level processing.
   So, we don't need `hoodie.partition.ttl.strategy` setting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716844


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext

Review Comment:
   should fix the import sequence, you can reference this checkstyle: 
https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Comparison between defaultParName and partValue [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10234:
URL: https://github.com/apache/hudi/pull/10234#discussion_r1413431893


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PartitionPruners.java:
##
@@ -94,7 +94,7 @@ private boolean evaluate(String partition) {
   Map partStats = new LinkedHashMap<>();
   for (int idx = 0; idx < partitionKeys.length; idx++) {
 String partKey = partitionKeys[idx];
-Object partVal = partKey.equals(defaultParName)
+Object partVal = partStrArray[idx].equals(defaultParName)

Review Comment:
   The partKey is never used, maybe just switch it to ` String partValStr= 
partStrArray[idx];`  and use this variable then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413428497


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   Hmm, maybe we just fix the table type as to be in line with the 
`hoodie.properties` when there is inconsistency instead of throwing, WDYT ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413426788


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   Then why the reader expects a `MOR` table if the table type was not 
specified by the table declaration?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413426165


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   It caused error when executing ` List rows2 = 
execSelectSql(streamTableEnv, "select * from t1", 10);`.
   
   For t1 hudi table : 
   table options in memorycatalog is mor,  but the  hoodie.property is cow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub



majian1998 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1413424885


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieRecord}
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.metadata.HoodieTableMetadata
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.JavaConversions.asScalaBuffer
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter, seqAsJavaListConverter}
+
+/**
+ * Calculate the degree of overlap between column stats.
+ * The overlap represents the extent to which the min-max ranges cover each 
other.
+ * By referring to the overlap, we can visually demonstrate the degree of data 
skipping
+ * for different columns under the current table's data layout.
+ * The calculation is performed at the partition level (assuming that data 
skipping is based on partition pruning).
+ *
+ * For example, consider three files: a.parquet, b.parquet, and c.parquet.
+ * Taking an integer-type column 'id' as an example, the range (min-max) for 
'a' is 1–5,
+ * for 'b' is 3–7, and for 'c' is 7–8. This results in their values 
overlapping on the coordinate axis as follows:
+ * Value Range: 1 2 3 4 5 6 7 8
+ * a.parquet:   [---]
+ * b.parquet:  []
+ * c.parquet:   [-]
+ * Thus, there will be overlap within the ranges 3–5 and 7.
+ * If the filter conditions for 'id' during data skipping include these values,
+ * multiple files will be filtered out. For a simpler case, if it's an 
equality query,
+ * 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).
+ *
+ * Additionally, calculating the degree of overlap based solely on the maximum 
values
+ * may not provide sufficient information. Therefore, we sample and calculate 
the overlap degree
+ * for all values involved in the min-max range. We also compute the degree of 
overlap
+ * at different percentiles and tally the count of these values.An example of 
a result is as follows:
+ * |Partition path |Field name |Average overlap  |Maximum file overlap |Total 
file number |50% overlap|75% overlap|95% overlap|99% 
overlap|Total value number |
+ * --
+ * |path   |c8 |1.33 |2   |2   
 |1 |1 |1 |1
 |3  |
+
+ */
+class ShowColumnStatsOverlapProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "partition", DataTypes.StringType),
+ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   The test error log:
   ```
   org.apache.flink.table.api.ValidationException: Unable to create a source 
for reading table 'default_catalog.default_database.t1'.
   
   Table options are:
   
   'connector'='hudi'
   
'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700'
   'read.streaming.enabled'='true'
   'table.type'='MERGE_ON_READ'
   
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219)
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205)
at 
org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69)
at 
org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48)
at 
org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262)
at 
org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728)
at 
org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166)
at 
org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   The test error log:
   ```
   org.apache.flink.table.api.ValidationException: Unable to create a source 
for reading table 'default_catalog.default_database.t1'.
   
   Table options are:
   
   'connector'='hudi'
   
'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700'
   'read.streaming.enabled'='true'
   'table.type'='MERGE_ON_READ'
   
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219)
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205)
at 
org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69)
at 
org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48)
at 
org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262)
at 
org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728)
at 
org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166)
at 
org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   The test error log:
   ```
   org.apache.flink.table.api.ValidationException: Unable to create a source 
for reading table 'default_catalog.default_database.t1'.
   
   Table options are:
   
   'connector'='hudi'
   
'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700'
   'read.streaming.enabled'='true'
   'table.type'='MERGE_ON_READ'
   
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219)
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205)
at 
org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69)
at 
org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48)
at 
org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262)
at 
org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728)
at 
org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166)
at 
org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at

Re: [PR] Comparison between defaultParName and partValue [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10234:
URL: https://github.com/apache/hudi/pull/10234#issuecomment-1837908541

   
   ## CI report:
   
   * a21ab0df2ee9e6a24219a6973624784a37017d00 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21286)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub



zyclove opened a new issue, #10235:
URL: https://github.com/apache/hudi/issues/10235

   
   **Describe the problem you faced**
   
   The spark job is too slow in follow stage.  Adjusting CPU, memory, and 
concurrency has no effect.
   Which stage can be optimized or skipped?
   
   
![image](https://github.com/apache/hudi/assets/15028279/e4122bc3-e02b-4f01-9010-737300b85bed)
   
   Is this normal? Why still use HoodieGlobalSimpleIndex？
   
![image](https://github.com/apache/hudi/assets/15028279/89cb305f-bc23-40a7-ac00-0adab5933b53)
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. table config
   ```
   CREATE  TABLE if NOT EXISTS bi_dw_real.smart_datapoint_report_rw_clear_rt(
 id STRING COMMENT 'id',
 uuid STRING COMMENT 'log uuid',
 data_id STRING COMMENT '',
 dev_id STRING COMMENT '',
 gw_id STRING COMMENT '',
 product_id STRING COMMENT '',
 uid STRING COMMENT '',
 dp_code STRING COMMENT '',
 dp_id STRING COMMENT '',
 dp_mode STRING COMMENT ',
 dp_name STRING COMMENT '',
 dp_time STRING COMMENT '',
 dp_type STRING COMMENT '',
 dp_value STRING COMMENT '',
 gmt_modified BIGINT COMMENT 'ct 时间',
 dt STRING COMMENT '时间分区字段'
   )
   using hudi 
   PARTITIONED BY (dt,dp_mode)
   COMMENT ''
   location '${bi_db_dir}/bi_ods_real/ods_smart_datapoint_report_rw_clear_rt'
   tblproperties (
 type = 'mor',
 primaryKey = 'id',
 preCombineField = 'gmt_modified',
 hoodie.combine.before.upsert='false',
 hoodie.metadata.record.index.enable='true',
 hoodie.datasource.write.operation='upsert',
 hoodie.metadata.table='true',
 hoodie.datasource.write.hive_style_partitioning='true',
 hoodie.metadata.record.index.min.filegroup.count ='512',
 hoodie.index.type='RECORD_INDEX',
 hoodie.compact.inline='false',
 hoodie.common.spillable.diskmap.type='ROCKS_DB',
 hoodie.datasource.write.partitionpath.field='dt,dp_mode',
 
hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload'
)
   ;
   
   set 
hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt;
   set hoodie.storage.layout.type=DEFAULT;
   set hoodie.metadata.record.index.enable=true;
   set hoodie.metadata.table=true;
   set hoodie.populate.meta.fields=false;
   set hoodie.parquet.compression.codec=snappy;
   set hoodie.memory.merge.max.size=200485760;
   set hoodie.write.buffer.limit.bytes=419430400;
   set hoodie.index.type=RECORD_INDEX;
   ``` 
   3.insert into bi_dw_real.smart_datapoint_report_rw_clear_rt
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.14.0
   
   * Spark version :3.2.1
   
   * Hive version :3.1.3
   
   * Hadoop version :3.2.2
   
   * Storage (HDFS/S3/GCS..) :s3
   
   * Running on Docker? (yes/no) :no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Comparison between defaultParName and partValue [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10234:
URL: https://github.com/apache/hudi/pull/10234#issuecomment-1837901855

   
   ## CI report:
   
   * a21ab0df2ee9e6a24219a6973624784a37017d00 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-03 Thread via GitHub



hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1413405718


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   The test error log:
   ```
   org.apache.flink.table.api.ValidationException: Unable to create a source 
for reading table 'default_catalog.default_database.t1'.
   
   Table options are:
   
   'connector'='hudi'
   
'path'='/var/folders/k1/65gcjk_92ws2bjh3ftpz33fcgp/T/junit1749019659644883700'
   'read.streaming.enabled'='true'
   'table.type'='MERGE_ON_READ'
   
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:219)
at 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:244)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.createDynamicTableSource(CatalogSourceTable.java:175)
at 
org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:115)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3997)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2867)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2427)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2341)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2286)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:723)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:709)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3843)
at 
org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:617)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:229)
at 
org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:205)
at 
org.apache.flink.table.planner.operations.SqlNodeConvertContext.toRelRoot(SqlNodeConvertContext.java:69)
at 
org.apache.flink.table.planner.operations.converters.SqlQueryConverter.convertSqlNode(SqlQueryConverter.java:48)
at 
org.apache.flink.table.planner.operations.converters.SqlNodeConverters.convertSqlNode(SqlNodeConverters.java:73)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:272)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNodeOrFail(SqlNodeToOperationConversion.java:390)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertSqlInsert(SqlNodeToOperationConversion.java:745)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convertValidatedSqlNode(SqlNodeToOperationConversion.java:353)
at 
org.apache.flink.table.planner.operations.SqlNodeToOperationConversion.convert(SqlNodeToOperationConversion.java:262)
at 
org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:106)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:728)
at 
org.apache.hudi.table.ITTestHoodieDataSource.submitSelectSql(ITTestHoodieDataSource.java:2192)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2185)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2180)
at 
org.apache.hudi.table.ITTestHoodieDataSource.execSelectSql(ITTestHoodieDataSource.java:2166)
at 
org.apache.hudi.table.ITTestHoodieDataSource.testStreamReadEmptyTablePath(ITTestHoodieDataSource.java:1026)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at

[PR] Comparison between defaultParName and partValue [hudi]

2023-12-03 Thread via GitHub



hehuiyuan opened a new pull request, #10234:
URL: https://github.com/apache/hudi/pull/10234

   ### Change Logs
   
   Comparison between defaultParName and partValue
   
   ### Impact
   
   
   
   ### Risk level (write none, low medium or high below)
   
   
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]

2023-12-03 Thread via GitHub



Sarfaraz-214 commented on issue #10233:
URL: https://github.com/apache/hudi/issues/10233#issuecomment-1837838654

   Hi @Amar1404 - I do not want to have a pre-combine key. I want to dump all 
the data I get as is. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]

2023-12-03 Thread via GitHub



Amar1404 commented on issue #10233:
URL: https://github.com/apache/hudi/issues/10233#issuecomment-1837836791

   Hi, @Sarfaraz-214 - Automatically the HoodieStreamer take 
hoodie.datasource.write.precombine.field default value as ts is taken, you need 
to pass value in properties file for the property 
**hoodie.datasource.write.precombine.field**, by using that it  takes 
precedence while selecting two records having similar primary keys


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]

2023-12-03 Thread via GitHub



Sarfaraz-214 opened a new issue, #10233:
URL: https://github.com/apache/hudi/issues/10233

   I am using HoodieStreamer with **Hudi 0.14** and trying to leverage 
[autogenerated 
keys](https://hudi.apache.org/releases/release-0.14.0/#support-for-hudi-tables-with-autogenerated-keys).
 Hence I am not passing **hoodie.datasource.write.recordkey.field** & 
**hoodie.datasource.write.precombine.field** .
   
   Additionally, I am passing **hoodie.spark.sql.insert.into.operation = 
insert** (instead of --op insert) which claims that there is no pre-combine key 
with bulk_insert and insert mode.
   
   With above the **.hoodie** directory gets created but the data write to GCS 
fails with error -
   ```
   org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in 
record. Acceptable fields were :[c1, c2, c3, c4, c5]
   at 
org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:601)
   ```
   
   I also see in **hoodie.properties** file pre-combine key is getting set to 
**ts** (hoodie.table.precombine.field=ts). Seems like this is getting set due 
to default value of **--source-ordering-field** . How can we skip the 
pre-combine field in this case?
   
   This is happening for both CoW & MoR tables.
   
   Actually this is running fine via Spark-SQL, but while using HoodieStreamer 
I am facing the issue. 
   
   Sharing the configurations used:
   
   **hudi-table.properties**
   ```
   hoodie.datasource.write.partitionpath.field=job_id
   hoodie.spark.sql.insert.into.operation=insert
   bootstrap.servers=***
   security.protocol=SASL_SSL
   sasl.mechanism=PLAIN
   sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule 
required username='***' password='***';
   auto.offset.reset=earliest
   hoodie.deltastreamer.source.kafka.topic=
   
hoodie.deltastreamer.schemaprovider.source.schema.file=gs:///.avsc
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
   
   ```
   
   **spark-submit command**
   ```
   spark-submit \
   --class org.apache.hudi.utilities.streamer.HoodieStreamer \
   --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.0, \
   --properties-file /home/sarfaraz_h/spark-config.properties \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 12G \
   --driver-cores 3 \
   --executor-memory 12G \
   --executor-cores 3 \
   --num-executors 3 \
   --conf spark.yarn.maxAppAttempts=1 \
   --conf spark.sql.shuffle.partitions=18 \
   gs:///jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
   --continuous \
   --source-limit 100 \
   --min-sync-interval-seconds 600 \
   --table-type COPY_ON_WRITE \
   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
   --target-base-path gs:/// \
   --target-table  \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
   --props gs:///configfolder/es_user_profile_config.properties
   ```
   
   Spark version used is - 3.3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10230:
URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837811537

   
   ## CI report:
   
   * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284)
 
   * 683b087f5a248c47b3edb95ee92cef71d8c0f506 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21285)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

2023-12-03 Thread via GitHub



jonathantransb opened a new issue, #10232:
URL: https://github.com/apache/hudi/issues/10232

   **Describe the problem you faced**
   
   I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with 
metadata enabled. However, my job appears to hang indefinitely at a certain 
step. Despite enabling DEBUG logs, I'm unable to find any indications of what 
may be causing this issue. Notably, this problem only occurs with Hudi tables 
where `clean` is the latest action in the timeline.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi table where `clean` is the latest action in the timeline 
   https://github.com/apache/hudi/assets/60864800/813c8190-8a47-48ed-8d8c-31dc2a29ff4b;>
   
   2. Open spark-shell
   ```bash
   spark-shell \
   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   --conf "spark.sql.parquet.filterPushdown=true" \
   --conf "spark.sql.parquet.mergeSchema=false" \
   --conf "spark.speculation=false" \
   --conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \
   --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
   --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
   --conf 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
 \
   --conf "spark.sql.catalogImplementation=hive" \
   --conf "spark.sql.catalog.spark_catalog.type=hive" \
   --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
   --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
   --conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar"
   ```
   
   3. Run spark.sql():
   ```bash
   scala> spark.sql("SET hoodie.metadata.enable=true")
   scala> spark.sql("SELECT * FROM . LIMIT 50").show()
   ```
   
   **Expected behavior**
   
   Spark job can read the table without hanging
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   I encountered no issues while using Hudi version 0.13.1. However, upon 
trying the new Hudi 0.14.0 version, I experienced this problem.
   
   For tables where `commit` is the latest action in the timeline, Hudi 0.14.0 
can read the table without any hanging issues.
   
   https://github.com/apache/hudi/assets/60864800/e8375fe0-7858-4779-b397-dc23f006c7dc;>
   
   The driver pod consistently uses up to 1 CPU core, although I'm uncertain 
about the processes that are running:
   
   https://github.com/apache/hudi/assets/60864800/d11979f7-c4ac-4703-80dd-6e122152b8fb;>
   
   
   **Stacktrace**
   
   ```
   23/12/03 12:21:23 INFO HiveConf: Found configuration file 
file:/opt/spark/conf/hive-site.xml
   23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client 
(version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse
   23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata 
: ap-southeast-1
   23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata 
: ap-southeast-1
   23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
at 10 second(s).
   23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system 
started
   23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 
credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS 
SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
   23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   23/12/03 12:21:28 INFO DataSourceUtils: Getting table path..
   23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path : 
s3:
   23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path: 
s3:
   23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3:
   23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from 
s3:/.hoodie/hoodie.properties
   23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3:
   23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false, 
tableType is: COPY_ON_WRITE, queryType is: snapshot
   23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
   23/12/03 12:21:28 INFO

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10230:
URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837806459

   
   ## CI report:
   
   * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284)
 
   * 683b087f5a248c47b3edb95ee92cef71d8c0f506 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



ksmou commented on code in PR #10230:
URL: https://github.com/apache/hudi/pull/10230#discussion_r1413331856


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java:
##
@@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() {
 assertThat("Recommits the instant with partial uncommitted events", 
lastCompleted, is(instant));
   }
 
+  @Test
+  public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws 
Exception {
+// reset
+reset();
+// override the default configuration
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+OperatorCoordinator.Context context = new 
MockOperatorCoordinatorContext(new OperatorID(), 1);
+coordinator = new StreamWriteOperatorCoordinator(conf, context);
+coordinator.start();
+coordinator.setExecutor(new MockCoordinatorExecutor(context));
+
+coordinator.getWriteClient().getConfig()
+.setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, 
HoodieFailedWritesCleaningPolicy.LAZY.name());
+
assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy());
+
+final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0);
+
+// start one instant and not commit it
+coordinator.handleEventFromOperator(0, event0);
+String instant = coordinator.getInstant();
+HoodieHeartbeatClient heartbeatClient = 
coordinator.getWriteClient().getHeartbeatClient();
+assertNotEquals(null, heartbeatClient.getHeartbeat(instant), "Heartbeat 
should not null");
+
+String basePath = tempFile.getAbsolutePath();
+HoodieWrapperFileSystem fs = 
coordinator.getWriteClient().getHoodieTable().getMetaClient().getFs();
+
+assertTrue(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), 
"Heartbeat is existed");
+
+// send bootstrap event to stop the heartbeat for this instant
+WriteMetadataEvent event1 = WriteMetadataEvent.emptyBootstrap(0);
+coordinator.handleEventFromOperator(0, event1);
+
+assertFalse(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), 
"Heartbeat is stopped and not existed");
+  }

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6822] Fix deletes handling in hbase index when partition path is updated [hudi]

2023-12-03 Thread via GitHub



flashJd commented on code in PR #9630:
URL: https://github.com/apache/hudi/pull/9630#discussion_r1413320714


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1402,38 +1397,13 @@ private void 
fetchOutofSyncFilesRecordsFromMetadataTable(Map 
getRecordIndexUpdates(HoodieData writeStatuses) {
-HoodiePairData recordKeyDelegatePairs = null;
-// if update partition path is true, chances that we might get two records 
(1 delete in older partition and 1 insert to new partition)
-// and hence we might have to do reduce By key before ingesting to RLI 
partition.
-if (dataWriteConfig.getRecordIndexUpdatePartitionPath()) {
-  recordKeyDelegatePairs = writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
-  .map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
-  .flatMapToPair(Stream::iterator)
-  .reduceByKey((recordDelegate1, recordDelegate2) -> {
-if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
-  if (!recordDelegate1.getNewLocation().isPresent() && 
!recordDelegate2.getNewLocation().isPresent()) {
-throw new HoodieIOException("Both version of records do not 
have location set. Record V1 " + recordDelegate1.toString()
-+ ", Record V2 " + recordDelegate2.toString());
-  }
-  if (recordDelegate1.getNewLocation().isPresent()) {
-return recordDelegate1;
-  } else {
-// if record delegate 1 does not have location set, record 
delegate 2 should have location set.
-return recordDelegate2;
-  }
-} else {
-  return recordDelegate1;
-}
-  }, Math.max(1, writeStatuses.getNumPartitions()));
-} else {
-  // if update partition path = false, we should get only one entry per 
record key.
-  recordKeyDelegatePairs = writeStatuses.flatMapToPair(
-  (SerializableFunction>>) writeStatus
-  -> writeStatus.getWrittenRecordDelegates().stream().map(rec -> 
Pair.of(rec.getRecordKey(), rec)).iterator());
-}
-return recordKeyDelegatePairs
-.map(writeStatusRecordDelegate -> {
-  HoodieRecordDelegate recordDelegate = 
writeStatusRecordDelegate.getValue();
+return writeStatuses.flatMap(writeStatus -> {
+  List recordList = new LinkedList<>();
+  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
+if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+  if (recordDelegate.getIgnoreFlag()) {

Review Comment:
   yeah, it's right, `we are setting the ignore flag only in indexing code and 
specifically when indexing could reutrn two version of record delegate.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] In the Flink catalog scenario, solve the problem of using SimpleKeyGe… [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10227:
URL: https://github.com/apache/hudi/pull/10227#discussion_r1413316108


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java:
##
@@ -248,6 +250,36 @@ public void testCreateTable() throws Exception {
 // test create exist table
 assertThrows(TableAlreadyExistException.class,
 () -> catalog.createTable(tablePath, EXPECTED_CATALOG_TABLE, false));
+
+// validate key generator for partitioned table
+HoodieTableMetaClient metaClient = HoodieTableMetaClient

Review Comment:
   You can use the `StreamerUtil.createMetaClient` for the initialization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] In the Flink catalog scenario, solve the problem of using SimpleKeyGe… [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10227:
URL: https://github.com/apache/hudi/pull/10227#discussion_r1413315922


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java:
##
@@ -248,6 +250,36 @@ public void testCreateTable() throws Exception {
 // test create exist table
 assertThrows(TableAlreadyExistException.class,
 () -> catalog.createTable(tablePath, EXPECTED_CATALOG_TABLE, false));
+
+// validate key generator for partitioned table
+HoodieTableMetaClient metaClient = HoodieTableMetaClient
+.builder()
+.setConf(new org.apache.hadoop.conf.Configuration())
+.setBasePath(catalog.inferTablePath(catalogPathStr, tablePath))
+.build();
+String keyGeneratorClassName = 
metaClient.getTableConfig().getKeyGeneratorClassName();
+assertEquals(keyGeneratorClassName, 
NonpartitionedAvroKeyGenerator.class.getName());

Review Comment:
   Why partitioned table uses `NonpartitionedAvroKeyGenerator` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7135] Spark reads hudi table error when flink creates the table without pre… [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10157:
URL: https://github.com/apache/hudi/pull/10157#discussion_r1413315255


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java:
##
@@ -96,12 +96,17 @@
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 
+import static org.apache.flink.table.factories.FactoryUtil.CONNECTOR;
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;

Review Comment:
   Maybe you should just import this checkstyle file: 
https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] support log index [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on PR #10143:
URL: https://github.com/apache/hudi/pull/10143#issuecomment-1837752363

   > I would like to ask that why data with the same primary key is written to 
different log files (with the same FileId and different timestamps) in upsert 
mode?
   
   The primary lifecycle is maintained within one FileGroup, different log 
files may indicate multiple changes to one key which scattered among multiple 
commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7167] Resolve type conversion exception when Spark writes to Hudi and synch… [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10229:
URL: https://github.com/apache/hudi/pull/10229#discussion_r1413311630


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala:
##
@@ -164,6 +161,20 @@ object CreateHoodieTableCommand {
 }
   }
 
+  private def convertFieldTypes(fieldTypes: StructType): StructType = {
+val fields = new Array[StructField](fieldTypes.fields.length)
+var index = 0

Review Comment:
   Should we just fix the hive sync schema instead? Modification in here 
affects beyond the Hive scope I guess.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10230:
URL: https://github.com/apache/hudi/pull/10230#discussion_r1413310745


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java:
##
@@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() {
 assertThat("Recommits the instant with partial uncommitted events", 
lastCompleted, is(instant));
   }
 
+  @Test
+  public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws 
Exception {
+// reset
+reset();
+// override the default configuration
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+OperatorCoordinator.Context context = new 
MockOperatorCoordinatorContext(new OperatorID(), 1);
+coordinator = new StreamWriteOperatorCoordinator(conf, context);
+coordinator.start();
+coordinator.setExecutor(new MockCoordinatorExecutor(context));
+
+coordinator.getWriteClient().getConfig()
+.setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, 
HoodieFailedWritesCleaningPolicy.LAZY.name());
+
assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy());
+
+final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0);
+
+// start one instant and not commit it
+coordinator.handleEventFromOperator(0, event0);
+String instant = coordinator.getInstant();
+HoodieHeartbeatClient heartbeatClient = 
coordinator.getWriteClient().getHeartbeatClient();
+assertNotEquals(null, heartbeatClient.getHeartbeat(instant), "Heartbeat 
should not null");
+
+String basePath = tempFile.getAbsolutePath();
+HoodieWrapperFileSystem fs = 
coordinator.getWriteClient().getHoodieTable().getMetaClient().getFs();
+
+assertTrue(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), 
"Heartbeat is existed");
+
+// send bootstrap event to stop the heartbeat for this instant
+WriteMetadataEvent event1 = WriteMetadataEvent.emptyBootstrap(0);
+coordinator.handleEventFromOperator(0, event1);
+
+assertFalse(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), 
"Heartbeat is stopped and not existed");
+  }

Review Comment:
   `Heartbeat is stopped and not existed` -> `Heartbeat is stopped and cleared`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10230:
URL: https://github.com/apache/hudi/pull/10230#discussion_r1413310224


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java:
##
@@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() {
 assertThat("Recommits the instant with partial uncommitted events", 
lastCompleted, is(instant));
   }
 
+  @Test
+  public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws 
Exception {
+// reset
+reset();
+// override the default configuration
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+OperatorCoordinator.Context context = new 
MockOperatorCoordinatorContext(new OperatorID(), 1);
+coordinator = new StreamWriteOperatorCoordinator(conf, context);
+coordinator.start();
+coordinator.setExecutor(new MockCoordinatorExecutor(context));
+
+coordinator.getWriteClient().getConfig()
+.setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, 
HoodieFailedWritesCleaningPolicy.LAZY.name());
+
assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy());
+
+final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0);
+
+// start one instant and not commit it
+coordinator.handleEventFromOperator(0, event0);
+String instant = coordinator.getInstant();
+HoodieHeartbeatClient heartbeatClient = 
coordinator.getWriteClient().getHeartbeatClient();
+assertNotEquals(null, heartbeatClient.getHeartbeat(instant), "Heartbeat 
should not null");
+

Review Comment:
   Use `assertNotNull`? And with better msg prompt: `Heartbeat is missing`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



danny0405 commented on code in PR #10230:
URL: https://github.com/apache/hudi/pull/10230#discussion_r1413309609


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java:
##
@@ -185,6 +188,41 @@ public void testRecommitWithPartialUncommittedEvents() {
 assertThat("Recommits the instant with partial uncommitted events", 
lastCompleted, is(instant));
   }
 
+  @Test
+  public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws 
Exception {
+// reset
+reset();
+// override the default configuration
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+OperatorCoordinator.Context context = new 
MockOperatorCoordinatorContext(new OperatorID(), 1);
+coordinator = new StreamWriteOperatorCoordinator(conf, context);
+coordinator.start();
+coordinator.setExecutor(new MockCoordinatorExecutor(context));
+
+coordinator.getWriteClient().getConfig()

Review Comment:
   Why not just set up as lazy cleaning in the `conf`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165] Flink multi writer not close the failed instant heartbeat [hudi]

2023-12-03 Thread via GitHub



ksmou commented on code in PR #10221:
URL: https://github.com/apache/hudi/pull/10221#discussion_r1413290959


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java:
##
@@ -420,6 +420,10 @@ private void initInstant(String instant) {
   }
   commitInstant(instant);
 }
+// stop the heartbeat for old instant

Review Comment:
   ut added. https://github.com/apache/hudi/pull/10230 @nsivabalan @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]

2023-12-03 Thread via GitHub



soumilshah1995 commented on issue #10231:
URL: https://github.com/apache/hudi/issues/10231#issuecomment-1837665661

   When using this 
   ```
   hoodie.datasource.write.recordkey.field=order_id
   hoodie.datasource.write.partitionpath.field=order_date
   
hoodie.streamer.source.dfs.root=file:Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders
   hoodie.datasource.write.precombine.field=ts
   
   hoodie.deltastreamer.csv.header=true
   hoodie.deltastreamer.csv.sep=\t
   
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.use_jdbc=false
   hoodie.datasource.hive_sync.mode=hms
   hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1
   hoodie.datasource.hive_sync.database=default
   hoodie.datasource.hive_sync.table=orders
   
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.hive_sync.partition_fields=order_date
   hoodie.datasource.write.hive_style_partitioning=true
   
   ```
   
   
   Now I am seeing this error 
   ```
   
   Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader jdk.internal.loader.ClassLoaders$AppClassLoader@67424e82, see the next 
exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 107 more
   Caused by: ERROR XSDB6: Another instance of Derby may have already booted 
the database 
/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/metastore_db.
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
 Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
 Source)
at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
at 
org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
at 
org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown 
Source)
at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
at 
org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
at 
org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
at org.apache.derby.impl.store.access.RAMAccessManager$5.run(Unknown 
Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.impl.store.access.RAMAccessManager.bootServiceModule(Unknown 
Source)
at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown 
Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
at 
org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
at 
org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
at org.apache.derby.impl.db.BasicDatabase$5.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.impl.db.BasicDatabase.bootServiceModule(Unknown 
Source)
at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
at

[I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]

2023-12-03 Thread via GitHub



soumilshah1995 opened a new issue, #10231:
URL: https://github.com/apache/hudi/issues/10231

   Hello everyone,
   I'm encountering a small issue that seems to be related to settings, and I 
would appreciate any guidance in identifying the problem. This pertains to my 
upcoming videos where I'm covering the Hudi Hive Sync tool in detail.
   I've started the Spark Thrift Server using the following command:
   
   ```
   
   spark-submit \
 --master 'local[*]' \
 --conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \
 --conf spark.eventLog.enabled=false \
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
 --name "Thrift JDBC/ODBC Server" \
 --executor-memory 512m \
 --packages org.apache.spark:spark-hive_2.12:3.4.0
   ```
   
   Additionally, I have Beeline installed and connected to the default database:
   
   ```
   beeline -u jdbc:hive2://localhost:1/default
   ```
   
   While my delta stream works fine, it appears that I'm facing issues using it 
with the Hive MetaStore.
   Here's my Spark submit command for the Hudi Delta Streamer:
   ```
   spark-submit \
   --class org.apache.hudi.utilities.streamer.HoodieStreamer \
   --packages 
'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2'
 \
   --repositories 'https://repo.maven.apache.org/maven2' \
   --properties-file 
/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/spark-config.properties
 \
   --master 'local[*]' \
   --executor-memory 1g \

/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar
  \
   --table-type COPY_ON_WRITE \
   --op UPSERT \
   --enable-hive-sync \
   --source-ordering-field ts \
   --source-class org.apache.hudi.utilities.sources.CsvDFSSource \
   --target-base-path file:///Users/soumilshah/Downloads/hudidb/ \
   --target-table orders \
   --props hudi_tbl.props
   ```
   
   Hudi CONF
   
   ```
   hoodie.datasource.write.recordkey.field=order_id
   hoodie.datasource.write.partitionpath.field=order_date
   
hoodie.streamer.source.dfs.root=file:Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders
   hoodie.datasource.write.precombine.field=ts
   
   hoodie.deltastreamer.csv.header=true
   hoodie.deltastreamer.csv.sep=\t
   
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.mode=jdbc
   hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1
   hoodie.datasource.hive_sync.database=default
   hoodie.datasource.hive_sync.table=orders
   hoodie.datasource.hive_sync.partition_fields=order_date
   
   ```
   
   Spark Conf:
   ```
   spark.serializer=org.apache.spark.serializer.KryoSerializer 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
spark.sql.hive.convertMetastoreParquet=false
   ```
   
   The error I'm encountering is:
   ```
   Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus 
requires this table to perform its persistence operations. Either your MetaData 
is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table 
missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to 
perform its persistence operations. Either your MetaData is incorrect, or you 
need to enable "datanucleus.schema.autoCreateTables"
   at 
org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606)
   at 
org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3385)
   ```
   
   Any assistance in identifying what might be missing or misconfigured would 
be highly appreciated.
   Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10230:
URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837597435

   
   ## CI report:
   
   * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10230:
URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837556153

   
   ## CI report:
   
   * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21284)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-03 Thread via GitHub



hudi-bot commented on PR #10230:
URL: https://github.com/apache/hudi/pull/10230#issuecomment-1837554127

   
   ## CI report:
   
   * 7f1bb03b66c2e0db151a3c8af511da953917cdd0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDi-7168][FOLLOW-UP] Add test case for HUDI-7165 [hudi]

2023-12-03 Thread via GitHub



ksmou opened a new pull request, #10230:
URL: https://github.com/apache/hudi/pull/10230

   ### Change Logs
   
   add ut
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7168) Add test cases for HUDI-7165

2023-12-03 Thread kwang (Jira)

kwang created HUDI-7168:
---

 Summary: Add test cases for HUDI-7165
 Key: HUDI-7168
 URL: https://issues.apache.org/jira/browse/HUDI-7168
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: kwang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

57 matches

Mail list logo