geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1413434722
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME Review Comment: I suppose it's better to implement only one processing for Partition TTL with accounting time only. For KEEP_BY_SIZE partition level looks not suitable. It's appropriate for record level processing. So, we don't need this setting: `hoodie.partition.ttl.strategy`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org