Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

via GitHub Sun, 03 Dec 2023 22:54:52 -0800


geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1413434722



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME

Review Comment:
   I suppose it's better to implement only one processing for Partition TTL 
with accounting time only. For KEEP_BY_SIZE partition level looks not suitable. 
It's appropriate for record level processing.
   So, we don't need this setting: `hoodie.partition.ttl.strategy`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

Reply via email to