geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1334248487


##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.

Review Comment:
   Sorry to be difficult, it's just providing TTL functionality by custom 
implementation of `PartitionTTLManagementStrategy` is not user friendly.
   We want to automate detection of outdated partitions and calling 
`delete_partition`. Could we just allow user to set partition path 
specification with TTL value, and implement everything internally?
   
   From my point of view, there are two main entities in TTL:
   - object
   In our case, it's partition, we define it by using `spec`.
   - definition of outdating
   It should be time or something time-dependent. In our case, we use current 
time, `_hoodie_commit_time`, and user-defined delta value.
   
   This is a main scope for TTL, and we shouldn't allow to have more 
flexibility.
   Customized implementation of `PartitionTTLManagementStrategy` will allow to 
do anything with partitions. It could be `PartitionManagementStrategy`, but we 
shouldn't named it with `TTL` part.



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions under `product=1` they want to keep for 
30 days while for partitions under `product=2` they want to keep for 7 days 
only. 
+
+For the first version of TTL management, we do not plan to implement a 
complicated strategy (For example, use an array to store strategies, introduce 
partition regex etc.). Instead, we add a new abstract method 
`getPartitionPathsForTTLManagement` in `PartitionTTLManagementStrategy` and 
provides a new config `hoodie.partition.ttl.management.partition.selected`. 
+
+If `hoodie.partition.ttl.management.partition.selected` is set, 
`getPartitionPathsForTTLManagement` will return partitions provided by this 
config. If not, `getPartitionPathsForTTLManagement` will return all partitions 
in the hudi table. 

Review Comment:
   Is it Hudi table new property? If we want to apply some default TTL for all 
partitions then it would contain list of all partitions?



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME

Review Comment:
   Do we really need to consider other policies? It's really confusing, that 
TTL (Time-To-Live) could consider anything else then time or time-dependent. In 
your first commit I found that there were KEEP_BY_SIZE and KEEP_BY_COUNT. But 
KEEP_BY_SIZE looks like we wanna delete some huge partition with a lot of data 
just because it is big, and KEEP_BY_COUNT is not suitable due to Hudi's 
partitioning model.



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   In current realization `HoodiePartitionMetadata` provides only `commitTime` 
(partition created commit time) and `partitionDepth` properties. We can add new 
`lastModifiedTime` property in `.hoodie_partition_metadata`, which is updated 
on every commit/deltacommit to corresponding partition.



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions under `product=1` they want to keep for 
30 days while for partitions under `product=2` they want to keep for 7 days 
only. 
+
+For the first version of TTL management, we do not plan to implement a 
complicated strategy (For example, use an array to store strategies, introduce 
partition regex etc.). Instead, we add a new abstract method 
`getPartitionPathsForTTLManagement` in `PartitionTTLManagementStrategy` and 
provides a new config `hoodie.partition.ttl.management.partition.selected`. 
+
+If `hoodie.partition.ttl.management.partition.selected` is set, 
`getPartitionPathsForTTLManagement` will return partitions provided by this 
config. If not, `getPartitionPathsForTTLManagement` will return all partitions 
in the hudi table. 
+
+TTL management strategies will only be applied for partitions return by 
`getPartitionPathsForTTLManagement`. 
+
+Thus, if users want to apply different strategies for different partitions, 
they can do the TTL management multiple times, selecting different partitions 
and apply corresponding strategies. And we may provide a batch interface in the 
future to simplify this. 

Review Comment:
   So, if I want to implement 6 partition policies, then I should prepare 6 
`PartitionTTLManagementStrategy`s and call them 6 times?
   
   I offer to allow user set only `spec` and `TTL value` with support of `add`, 
`show`, and `delete` operations. We can also implement `dry run` for checking 
partition list.
   Then during TTL checking:
   - Iterate through a list of all partitions and compare them to `spec`.
   Later listing of all partitions could be optimized by storing it in separate 
data structure. But the most hard task would be consistency providing in the 
case of concurrent writing with new partition creating and removing some 
partitions.
   - Compare partition to the list of policies.
   This list would be ordered by TTL value in ascending or descending order. It 
would be defined by a resolution strategy for multiple policies conflict, use 
MIN_TTL or MAX_TTL.
   - Then compare current time with `lastModifiedTime`.
   Later it also could be optimized by partition related data structure for 
decreasing reading from HDFS.



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions under `product=1` they want to keep for 
30 days while for partitions under `product=2` they want to keep for 7 days 
only. 

Review Comment:
   We can allow user to write specification,
   `spec = "*/product=1/*", TTL = 30 DAYS`
   `spec = "*/product=2/*, TTL = 7 DAYS`
   It also allows to set default TTL by:
   `spec = "*", TTL = 2 MONTHS`
   
   In this case we will have multiple policies for some partitons. To resolve 
conflict automatically, we could add  
`hoodie.ttl.policies.conflict.resolution.rule` property with available options:
   `MIN_TTL` - policy with min TTL value would be choosen, used in the case of 
"prefer cleaning" to minimize disk space costs,
   `MAX_TTL` - policy with max TTL value would be choosen, used in the case of 
"prefer to save data" to prevent deletion of any data in case of ambiguity.
   
   Specification is unique and can be used to drop some policies.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to