Yue Zhang created HUDI-3038: ------------------------------- Summary: Comprehensive mechanism around cleaning the archived timeline Key: HUDI-3038 URL: https://issues.apache.org/jira/browse/HUDI-3038 Project: Apache Hudi Issue Type: Improvement Reporter: Yue Zhang
At present, Hoodie's archive file grows indefinitely, which is more serious for dfs that does not support append. After PR https://github.com/apache/hudi/pull/4078, now users will have some way to trim the archive files and not keep expanding indefinitely. But as the document said *WARNING: do not use this config unless you know what you're doing. If enabled, details of older archived instants are deleted, resulting in information loss in the archived timeline, which may affect tools like CLI and repair. Only enable this if you hit severe performance issues for retrieving archived timeline.* So we need a more comprehensive mechanism around cleaning the archived timeline. (1) Rewrite the archived timeline content into a smaller number of files (2) When deleting the archived files, make sure the table does not have any corresponding base or log files from the contained instants, so there is essentially no information loss of the table states. As we know this two operation is pretty heavy, maybe we could build a new tool instead of a inner service to make it happen. -- This message was sent by Atlassian Jira (v8.20.1#820001)