bhavya-ganatra opened a new issue, #18686:
URL: https://github.com/apache/hudi/issues/18686

   ### Task Description
   
   **What needs to be done:**
   This issue is based on a discussion in the Hudi Slack channel regarding 
performance degradation with a large active timeline: 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1774948526496729
   
   Propose and/or implement a solution to decouple savepoints from timeline 
archival (e.g., enable archival without losing restore capability).
   
   Additionally, update documentation to clearly state the impact of 
`hoodie.archive.beyond.savepoint` on savepoint restore behaviour.
   
   
   **Why this task is needed:**
   
   We are running a streaming pipeline writing to multiple Hudi MOR tables with:
   - Async compaction and cleaner
   - Commit frequency: every 5 minutes
   - Savepoints retained for 7 days (1 per 24 hours)
   
   Savepoints are required for our backup/recovery strategy and cannot be 
reduced. However, savepoints block archival of commits in the timeline, leading 
to continuous timeline growth and noticeable performance degradation in both 
reads and writes.
   
   Currently, the config `hoodie.archive.beyond.savepoint` allows archival 
beyond savepoints, but at the cost of losing savepoint restore capability 
(i.e., savepoints become non-recoverable): -> 
https://github.com/apache/hudi/pull/6239
   
   Hence, to resolve this, we need decoupling of savepoint from the timeline 
archival process, so that we can have "Restore capability" without having 
significant Performance degradation. 
   
   JFI: This task request was already part of this Jira: 
https://issues.apache.org/jira/browse/HUDI-4500. But since, Hudi is moved to 
Github Issues, I am creating this. 
   
   
   ### Task Type
   
   Performance optimization
   
   ### Related Issues
   
   **Parent feature issue:**  https://issues.apache.org/jira/browse/HUDI-4500
   **Related issues:**  https://issues.apache.org/jira/browse/HUDI-4501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to