[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-07-05 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7779:
--
Status: Patch Available  (was: In Progress)

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Sprint: 2024/06/17-30, 2024/06/03-16  (was: 2024/06/03-16)

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7779:
-
Labels: pull-request-available  (was: )

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Fix Version/s: 0.16.0
   1.0.0

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Status: In Progress  (was: Open)

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Sprint: 2024/06/03-16

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to account for 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-24 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-24 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-23 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to 

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for on-clean commits and hence this is not an issue. None of 
snapshot, time travel query or incremental query will run into issues as they 
are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3, but will clean up t4, t5 and 
t6. So, earliest commit to retain will be pointing to t6. And say savepoint for 
t3 is removed, but cleaner was disabled. In this state of the timeline, if 
archival is executed, (since t3.savepoint is removed), archival might archive 
t3 and t4.rc.  This could lead to data duplicates as both replaced file groups 
and new file groups from t4.rc would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 2 options to go about to solve this.

*Option A:* 

Before archiving any replace commit by the archiver, lets explicitly check that 
all replaced file groups are fully deleted. 

Cons: Might need FileSystemView polling which might be costly. 

*OptionB:*

Cleaner also tracks an additional metadata named, "fully cleaned up file 
groups" at the end of clean planning and in completed clean commit metadata. 

So, archival instead of polling FileSystemView (which might be costly), it can 
check for clean commit metadata for the list of file groups and can deduce if 
all file groups replaced by X.rc is fully deleted. 

Pros: 

Since clean planner anyways polls the file system view and has all file group 
info already, no additional work might be required to deduce "fully cleaned up 
file groups". Just that it needs to add an additional metadata. 

 

 

 

 

 

 

 

 

  was:
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside