Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

via GitHub Mon, 04 Dec 2023 00:02:06 -0800


geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413454783



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions under `product=1` they want to keep for 
30 days while for partitions under `product=2` they want to keep for 7 days 
only. 

Review Comment:
   > We need to care about policies defined in different level, for example we 
have a,b,c,d as partition fields, can we define ttl policies in any level for 
the partition?
   
   If partition suitable for the specified policy `spec` then check 
`lastUpdateTime`, no matter on which level this partition is. I suppose it is 
more clear behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

Reply via email to