geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413444578


##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   Currently, HoodiePartitionMetadata, `.hoodie_partition_metadata` files, 
provides only `commitTime` and `partitionDepth` properties. This file is 
written during partition creation and is not updated later. We can make this 
file more usable by adding a new property, for instance, `lastUpdateTime` for 
saving time of the last upsert/delete operation on the partition with each 
commit/deltacommit/replacecommit. And use this property to handle for Partition 
TTL check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to