[ 
https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1276:
-------------------------
    Description: 
We clean replaced file groups during archival as part of PR#2048. But we may 
want do this during clean stage to prevent storage overhead.

Outstanding questions:
1) With KEEP_LATEST_VERSIONS, when is a replaced file eligible to clean? Assume 
file slice has f1_c1, f1_c2. After that 'f1' is replaced by some other file 
groups.   If KEEP_LATEST_VERSIONS=2 When can we delete f1_c1, f1_c2?

Options:
* We can introduce new policy to delete replaced files. For example, we could 
fallback to KEEP_LATEST_COMMITS for replaced files
* Build 'slice' across file groups. If we know the new files that are replacing 
'f1', then we can treat as single slice and delete oldest versions. This can 
get really complicated because f1 can be replaced by multiple file groups which 
can then be replaced by some other file groups

2)If there is a savepoint on the fileId that is eligible to clean, can we 
delete it?
Options: 
* Do not delete the file. Clean and archival cannot make progress. We need a 
mechanism to notify that clean and archival are blocked.
* Ignore savepoints and delete the file. This is breaking contract. (This is 
current behavior with deleting files during archival)

3)If there is a pending/inflight compaction on the fileId that is eligible to 
clean, can we delete it? What happens to compaction scheduled if we delete it?
* This is unlikely to happen because we dont replace files that have pending 
compaction. Also, after a file is replaced, it is not visible to compaction, so 
any further compaction cannot be scheduled.  However, if for any reason, we see 
replaced files that have pending compaction, and are eligible to clean, its 
probably better to block clean and archival



  was:We clean replaced file groups during archival as part of PR#2048. But we 
may want do this during clean stage to prevent storage overhead


> delete replaced file groups during clean
> ----------------------------------------
>
>                 Key: HUDI-1276
>                 URL: https://issues.apache.org/jira/browse/HUDI-1276
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: satish
>            Assignee: satish
>            Priority: Major
>             Fix For: 0.7.0
>
>
> We clean replaced file groups during archival as part of PR#2048. But we may 
> want do this during clean stage to prevent storage overhead.
> Outstanding questions:
> 1) With KEEP_LATEST_VERSIONS, when is a replaced file eligible to clean? 
> Assume file slice has f1_c1, f1_c2. After that 'f1' is replaced by some other 
> file groups.   If KEEP_LATEST_VERSIONS=2 When can we delete f1_c1, f1_c2?
> Options:
> * We can introduce new policy to delete replaced files. For example, we could 
> fallback to KEEP_LATEST_COMMITS for replaced files
> * Build 'slice' across file groups. If we know the new files that are 
> replacing 'f1', then we can treat as single slice and delete oldest versions. 
> This can get really complicated because f1 can be replaced by multiple file 
> groups which can then be replaced by some other file groups
> 2)If there is a savepoint on the fileId that is eligible to clean, can we 
> delete it?
> Options: 
> * Do not delete the file. Clean and archival cannot make progress. We need a 
> mechanism to notify that clean and archival are blocked.
> * Ignore savepoints and delete the file. This is breaking contract. (This is 
> current behavior with deleting files during archival)
> 3)If there is a pending/inflight compaction on the fileId that is eligible to 
> clean, can we delete it? What happens to compaction scheduled if we delete it?
> * This is unlikely to happen because we dont replace files that have pending 
> compaction. Also, after a file is replaced, it is not visible to compaction, 
> so any further compaction cannot be scheduled.  However, if for any reason, 
> we see replaced files that have pending compaction, and are eligible to 
> clean, its probably better to block clean and archival



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to