Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

via GitHub Mon, 20 Nov 2023 19:28:05 -0800


stream2000 commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1399985395



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,248 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a lifecycle 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces partition lifecycle management strategies to hudi, 
people can config the strategies by write
+configs. With proper configs set, Hudi can find out which partitions are 
expired and delete them.
+
+This proposal introduces partition lifecycle management service to hudi. 
Lifecycle management is like other table
+services such as Clean/Compaction/Clustering.
+Users can config their partition lifecycle management strategies through write 
configs and Hudi will help users find
+expired partitions and delete them automatically.
+
+## Background
+
+Lifecycle management mechanism is an important feature for databases. Hudi 
already provides a `delete_partition`
+interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of partition lifecycle
+management strategies, find expired partitions and call `delete_partition` by 
themselves. As the scale of installations
+grew, it is becoming increasingly important to implement a user-friendly 
lifecycle management mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition lifecycle management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or
+  asynchronous table services.
+
+### Strategy Definition
+
+The lifecycle strategies is similar to existing table service strategies. We 
can define lifecycle strategies like
+defining a clustering/clean/compaction strategy:
+
+```properties
+hoodie.partition.lifecycle.management.strategy=KEEP_BY_TIME
+hoodie.partition.lifecycle.management.strategy.class=org.apache.hudi.table.action.lifecycle.strategy.KeepByTimePartitionLifecycleManagementStrategy
+hoodie.partition.lifecycle.days.retain=10
+```
+
+The config `hoodie.partition.lifecycle.management.strategy.class` is to 
provide a strategy class (subclass
+of `PartitionLifecycleManagementStrategy`) to get expired partition paths to 
delete.
+And `hoodie.partition.lifecycle.days.retain` is the strategy value used
+by `KeepByTimePartitionLifecycleManagementStrategy` which means that we will 
expire partitions that haven't been
+modified for this strategy value set. We will cover the 
`KeepByTimePartitionLifecycleManagementStrategy` strategy in
+detail in the next section.
+
+The core definition of `PartitionLifecycleManagementStrategy` looks like this:
+
+```java
+/**
+ * Strategy for partition-level lifecycle management.
+ */
+public abstract class PartitionLifecycleManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition lifecycle management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List<String> getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of 
`PartitionLifecycleManagementStrategy` and hudi will help delete the
+expired partitions.
+
+### KeepByTimePartitionLifecycleManagementStrategy
+
+We will provide a strategy call 
`KeepByTimePartitionLifecycleManagementStrategy` in the first version of 
partition
+lifecycle management implementation.
+
+The `KeepByTimePartitionLifecycleManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If
+duration between now and 'lastModifiedTime' for the partition is larger than
+what `hoodie.partition.lifecycle.days.retain` configured, 
`KeepByTimePartitionLifecycleManagementStrategy` will mark
+this partition as an expired partition. We use day as the unit of expired time 
since it is very common-used for
+datalakes. Open to ideas for this.
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the
+partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However,
+we can assume that we won't do clustering for a partition without new writes 
for a long time when using the strategy.
+And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example
+using metadata table.
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may
+have multi partition fileds(productId, day). For partitions under `product=1` 
they want to keep for 30 days while for
+partitions under `product=2` they want to keep for 7 days only.
+
+For the first version of partition lifecycle management, we do not plan to 
implement a complicated strategy (For
+example, use an array to store strategies, introduce partition regex etc.). 
Instead, we add a new abstract
+method `getPartitionPathsForLifecycleManagement` in 
`PartitionLifecycleManagementStrategy` and provides a new
+config `hoodie.partition.lifecycle.management.partition.selected`.
+
+If `hoodie.partition.lifecycle.management.partition.selected` is set, 
`getPartitionPathsForLifecycleManagement` will
+return partitions provided by this config. If not, 
`getPartitionPathsForLifecycleManagement` will return all partitions
+in the hudi table.
+
+lifecycle management strategies will only be applied for partitions return by 
`getPartitionPathsForLifecycleManagement`.
+
+Thus, if users want to apply different strategies for different partitions, 
they can do the partition lifecycle
+management multiple times, selecting different partitions and apply 
corresponding strategies. And we may provide a batch
+interface in the future to simplify this.
+
+The `getPartitionPathsForLifecycleManagement` method will look like this:
+
+```java
+/**
+ * Strategy for partition-level lifecycle management.
+ */
+public abstract class PartitionLifecycleManagementStrategy {
+  /**
+   * Scan and list all partitions for partition lifecycle management.
+   *
+   * @return Partitions to apply lifecycle management strategy
+   */
+  protected List<String> getPartitionPathsForLifecycleManagement() {
+    if 
(StringUtils.isNullOrEmpty(config.getLifecycleManagementPartitionSelected())) {
+      return getMatchedPartitions();
+    } else {
+      // Return All partition paths
+      return FSUtils.getAllPartitionPaths(hoodieTable.getContext(), 
config.getMetadataConfig(), config.getBasePath());
+    }
+  }
+}
+``` 
+
+### Executing partition lifecycle management
+
+Once we already have a proper `PartitionLifecycleManagementStrategy` 
implementation, it's easy to execute the partition
+lifecycle management.
+
+```java
+public class SparkPartitionLifecycleManagementActionExecutor<T> extends 
BaseSparkCommitActionExecutor<T> {
+  @Override
+  public HoodieWriteMetadata<HoodieData<WriteStatus>> execute() {
+    // Construct PartitionLifecycleManagementStrategy
+    PartitionLifecycleManagementStrategy strategy = 
(PartitionLifecycleManagementStrategy) ReflectionUtils.loadClass(
+        
PartitionLifecycleManagementStrategy.checkAndGetPartitionLifecycleManagementStrategy(config),
+        new Class<?>[] {HoodieTable.class, HoodieEngineContext.class, 
HoodieWriteConfig.class}, table, context, config);
+
+    // Get expired partition paths
+    List<String> expiredPartitions = strategy.getExpiredPartitionPaths();
+
+    // Delete them reusing SparkDeletePartitionCommitActionExecutor
+    return new SparkDeletePartitionCommitActionExecutor<>(context, config, 
table, instantTime,
+        expiredPartitions).execute();
+  }
+}
+```
+
+We will add a new method `managePartitionLifecycle` in `HoodieTable` and 
`HoodieSparkCopyOnWriteTable` can implement it
+like this:
+
+```java
+@Override
+public 
HoodieWriteMetadata<HoodieData<WriteStatus>>managePartitionLifecycle(HoodieEngineContext
 context,String instantTime){
+    return new 
SparkPartitionLifecycleManagementActionExecutor<>(context,config,this,instantTime).execute();
+    }
+```
+
+We can call `hoodieTable.managePartitionLifecycle` in independent flink/spark 
job, in async/sync inline table services
+like clustering/compaction/clean etc.
+
+### User interface for Partition lifecycle Management
+
+We can do partition lifecycle management inline with streaming ingestion job 
or do it with a independent batch job, for
+both spark and flink engine.
+
+#### Run inline with Streaming Ingestion
+
+Since we can run clustering inline with streaming ingestion job through the 
following config:
+
+```properties
+hoodie.clustering.async.enabled=true
+hoodie.clustering.async.max.commits=5
+```
+
+We can do similar thing for partition lifecycle management. The config for 
async partition lifecycle management are:
+
+| Config key                                              | Remarks            
                                                                                
                            | Default |
+|---------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|---------|
+| hoodie.partition.lifecycle.management.async.enabled     | Enable running of 
lifecycle management service, asynchronously as writes happen on the table.     
                             | False   |

Review Comment:
   Yes, we will use `delete partiton` command to delete the expired partitions, 
this is mentioned in `Background` section. And conflict resolution for the 
operation is the same with what we do in `delete partiton` command. 



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,248 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a lifecycle 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces partition lifecycle management strategies to hudi, 
people can config the strategies by write
+configs. With proper configs set, Hudi can find out which partitions are 
expired and delete them.
+
+This proposal introduces partition lifecycle management service to hudi. 
Lifecycle management is like other table
+services such as Clean/Compaction/Clustering.
+Users can config their partition lifecycle management strategies through write 
configs and Hudi will help users find
+expired partitions and delete them automatically.
+
+## Background
+
+Lifecycle management mechanism is an important feature for databases. Hudi 
already provides a `delete_partition`
+interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of partition lifecycle
+management strategies, find expired partitions and call `delete_partition` by 
themselves. As the scale of installations
+grew, it is becoming increasingly important to implement a user-friendly 
lifecycle management mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition lifecycle management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or
+  asynchronous table services.
+
+### Strategy Definition
+
+The lifecycle strategies is similar to existing table service strategies. We 
can define lifecycle strategies like
+defining a clustering/clean/compaction strategy:
+
+```properties
+hoodie.partition.lifecycle.management.strategy=KEEP_BY_TIME
+hoodie.partition.lifecycle.management.strategy.class=org.apache.hudi.table.action.lifecycle.strategy.KeepByTimePartitionLifecycleManagementStrategy

Review Comment:
   Use the word `ttl` means that we can't apply time unrelative partition 
deletion strategy in the future. However it is more suitable for current 
implementation since we only provide a time based strategy. What do you think? 
Should we use ttl directly or remain some extensibility? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

Reply via email to