stream2000 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1399985395
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,248 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a lifecycle management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces partition lifecycle management strategies to hudi, people can config the strategies by write +configs. With proper configs set, Hudi can find out which partitions are expired and delete them. + +This proposal introduces partition lifecycle management service to hudi. Lifecycle management is like other table +services such as Clean/Compaction/Clustering. +Users can config their partition lifecycle management strategies through write configs and Hudi will help users find +expired partitions and delete them automatically. + +## Background + +Lifecycle management mechanism is an important feature for databases. Hudi already provides a `delete_partition` +interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of partition lifecycle +management strategies, find expired partitions and call `delete_partition` by themselves. As the scale of installations +grew, it is becoming increasingly important to implement a user-friendly lifecycle management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition lifecycle management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or + asynchronous table services. + +### Strategy Definition + +The lifecycle strategies is similar to existing table service strategies. We can define lifecycle strategies like +defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.lifecycle.management.strategy=KEEP_BY_TIME +hoodie.partition.lifecycle.management.strategy.class=org.apache.hudi.table.action.lifecycle.strategy.KeepByTimePartitionLifecycleManagementStrategy +hoodie.partition.lifecycle.days.retain=10 +``` + +The config `hoodie.partition.lifecycle.management.strategy.class` is to provide a strategy class (subclass +of `PartitionLifecycleManagementStrategy`) to get expired partition paths to delete. +And `hoodie.partition.lifecycle.days.retain` is the strategy value used +by `KeepByTimePartitionLifecycleManagementStrategy` which means that we will expire partitions that haven't been +modified for this strategy value set. We will cover the `KeepByTimePartitionLifecycleManagementStrategy` strategy in +detail in the next section. + +The core definition of `PartitionLifecycleManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level lifecycle management. + */ +public abstract class PartitionLifecycleManagementStrategy { + /** + * Get expired partition paths for a specific partition lifecycle management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionLifecycleManagementStrategy` and hudi will help delete the +expired partitions. + +### KeepByTimePartitionLifecycleManagementStrategy + +We will provide a strategy call `KeepByTimePartitionLifecycleManagementStrategy` in the first version of partition +lifecycle management implementation. + +The `KeepByTimePartitionLifecycleManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If +duration between now and 'lastModifiedTime' for the partition is larger than +what `hoodie.partition.lifecycle.days.retain` configured, `KeepByTimePartitionLifecycleManagementStrategy` will mark +this partition as an expired partition. We use day as the unit of expired time since it is very common-used for +datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the +partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, +we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. +And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example +using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may +have multi partition fileds(productId, day). For partitions under `product=1` they want to keep for 30 days while for +partitions under `product=2` they want to keep for 7 days only. + +For the first version of partition lifecycle management, we do not plan to implement a complicated strategy (For +example, use an array to store strategies, introduce partition regex etc.). Instead, we add a new abstract +method `getPartitionPathsForLifecycleManagement` in `PartitionLifecycleManagementStrategy` and provides a new +config `hoodie.partition.lifecycle.management.partition.selected`. + +If `hoodie.partition.lifecycle.management.partition.selected` is set, `getPartitionPathsForLifecycleManagement` will +return partitions provided by this config. If not, `getPartitionPathsForLifecycleManagement` will return all partitions +in the hudi table. + +lifecycle management strategies will only be applied for partitions return by `getPartitionPathsForLifecycleManagement`. + +Thus, if users want to apply different strategies for different partitions, they can do the partition lifecycle +management multiple times, selecting different partitions and apply corresponding strategies. And we may provide a batch +interface in the future to simplify this. + +The `getPartitionPathsForLifecycleManagement` method will look like this: + +```java +/** + * Strategy for partition-level lifecycle management. + */ +public abstract class PartitionLifecycleManagementStrategy { + /** + * Scan and list all partitions for partition lifecycle management. + * + * @return Partitions to apply lifecycle management strategy + */ + protected List<String> getPartitionPathsForLifecycleManagement() { + if (StringUtils.isNullOrEmpty(config.getLifecycleManagementPartitionSelected())) { + return getMatchedPartitions(); + } else { + // Return All partition paths + return FSUtils.getAllPartitionPaths(hoodieTable.getContext(), config.getMetadataConfig(), config.getBasePath()); + } + } +} +``` + +### Executing partition lifecycle management + +Once we already have a proper `PartitionLifecycleManagementStrategy` implementation, it's easy to execute the partition +lifecycle management. + +```java +public class SparkPartitionLifecycleManagementActionExecutor<T> extends BaseSparkCommitActionExecutor<T> { + @Override + public HoodieWriteMetadata<HoodieData<WriteStatus>> execute() { + // Construct PartitionLifecycleManagementStrategy + PartitionLifecycleManagementStrategy strategy = (PartitionLifecycleManagementStrategy) ReflectionUtils.loadClass( + PartitionLifecycleManagementStrategy.checkAndGetPartitionLifecycleManagementStrategy(config), + new Class<?>[] {HoodieTable.class, HoodieEngineContext.class, HoodieWriteConfig.class}, table, context, config); + + // Get expired partition paths + List<String> expiredPartitions = strategy.getExpiredPartitionPaths(); + + // Delete them reusing SparkDeletePartitionCommitActionExecutor + return new SparkDeletePartitionCommitActionExecutor<>(context, config, table, instantTime, + expiredPartitions).execute(); + } +} +``` + +We will add a new method `managePartitionLifecycle` in `HoodieTable` and `HoodieSparkCopyOnWriteTable` can implement it +like this: + +```java +@Override +public HoodieWriteMetadata<HoodieData<WriteStatus>>managePartitionLifecycle(HoodieEngineContext context,String instantTime){ + return new SparkPartitionLifecycleManagementActionExecutor<>(context,config,this,instantTime).execute(); + } +``` + +We can call `hoodieTable.managePartitionLifecycle` in independent flink/spark job, in async/sync inline table services +like clustering/compaction/clean etc. + +### User interface for Partition lifecycle Management + +We can do partition lifecycle management inline with streaming ingestion job or do it with a independent batch job, for +both spark and flink engine. + +#### Run inline with Streaming Ingestion + +Since we can run clustering inline with streaming ingestion job through the following config: + +```properties +hoodie.clustering.async.enabled=true +hoodie.clustering.async.max.commits=5 +``` + +We can do similar thing for partition lifecycle management. The config for async partition lifecycle management are: + +| Config key | Remarks | Default | +|---------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|---------| +| hoodie.partition.lifecycle.management.async.enabled | Enable running of lifecycle management service, asynchronously as writes happen on the table. | False | Review Comment: Yes, we will use `delete partiton` command to delete the expired partitions, this is mentioned in `Background` section. And conflict resolution for the operation is the same with what we do in `delete partiton` command. ########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,248 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a lifecycle management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces partition lifecycle management strategies to hudi, people can config the strategies by write +configs. With proper configs set, Hudi can find out which partitions are expired and delete them. + +This proposal introduces partition lifecycle management service to hudi. Lifecycle management is like other table +services such as Clean/Compaction/Clustering. +Users can config their partition lifecycle management strategies through write configs and Hudi will help users find +expired partitions and delete them automatically. + +## Background + +Lifecycle management mechanism is an important feature for databases. Hudi already provides a `delete_partition` +interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of partition lifecycle +management strategies, find expired partitions and call `delete_partition` by themselves. As the scale of installations +grew, it is becoming increasingly important to implement a user-friendly lifecycle management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition lifecycle management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or + asynchronous table services. + +### Strategy Definition + +The lifecycle strategies is similar to existing table service strategies. We can define lifecycle strategies like +defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.lifecycle.management.strategy=KEEP_BY_TIME +hoodie.partition.lifecycle.management.strategy.class=org.apache.hudi.table.action.lifecycle.strategy.KeepByTimePartitionLifecycleManagementStrategy Review Comment: Use the word `ttl` means that we can't apply time unrelative partition deletion strategy in the future. However it is more suitable for current implementation since we only provide a time based strategy. What do you think? Should we use ttl directly or remain some extensibility? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org