[ 
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2458:
--------------------------------------
    Description: 
Relax compaction in metadata being fenced based on inflight requests in data 
table.

Compaction is metadata is triggered only if there are no inflight requests in 
data table. This might cause liveness problem since for very large deployments, 
we could either have compaction or clustering always in progress. So, we should 
try to see how we can relax this constraint.

 

Proposal to remove this dependency:

With recent addition of spurious deletes config, we can actually get away with 
this. 

As of now, we have 3 inter linked nuances.

- Compaction in metadata may not kick in, if there are any inflight operations 
in data table. 

- Rollback when being applied to metadata table has a dependency on last 
compaction instant in metadata table. We might even throw exception if instant 
to rollback is < latest metadata compaction instant time. 

- Archival in data table is fenced by latest compaction in metadata table. 

 

So, just incase data timeline has any dangling inflght operation (lets say 
someone tried clustering, and killed midway and did not ever attempt again), 
metadata compaction will never kick in at all for good. I need to check what 
does archival do for such inflight operations in data table though. 

 

So, with spurious deletes support which we added recently, all these can be 
much simplified. 

Whenever we want to apply a rollback commit, we don't need to take different 
actions based on whether the commit being rolled back is already committed to 
metadata table or not. Just go ahead and apply the rollback. Merging of 
metadata payload records will take care of this. If the commit was already 
synced, final merged payload may not have spurious deletes. If the commit being 
rolledback was never committed to metadata, final merged payload may have some 
spurious deletes which we can ignore. 

With this, compaction in metadata does not need to have any dependency on 
inflight operations in data table. 

And we can loosen up the dependency of archival in data table on metadata table 
compaction as well. 

So, in summary, all the 3 dependencies quoted above will be moot if we go with 
this approach. 

Especially our logic to apply rollback metadata to metadata table will become a 
lot simpler and is easy to reason about. 

 

 

 

 

  was:
Relax compaction in metadata being fenced based on inflight requests in data 
table.

 

Compaction is metadata is triggered only if there are no inflight requests in 
data table. This might cause liveness problem since for very large deployments, 
we could either have compaction or clustering always in progress. So, we should 
try to see how we can relax this constraint.


> Relax compaction in metadata being fenced based on inflight requests in data 
> table
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-2458
>                 URL: https://issues.apache.org/jira/browse/HUDI-2458
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.11.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data 
> table.
> Compaction is metadata is triggered only if there are no inflight requests in 
> data table. This might cause liveness problem since for very large 
> deployments, we could either have compaction or clustering always in 
> progress. So, we should try to see how we can relax this constraint.
>  
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away 
> with this. 
> As of now, we have 3 inter linked nuances.
> - Compaction in metadata may not kick in, if there are any inflight 
> operations in data table. 
> - Rollback when being applied to metadata table has a dependency on last 
> compaction instant in metadata table. We might even throw exception if 
> instant to rollback is < latest metadata compaction instant time. 
> - Archival in data table is fenced by latest compaction in metadata table. 
>  
> So, just incase data timeline has any dangling inflght operation (lets say 
> someone tried clustering, and killed midway and did not ever attempt again), 
> metadata compaction will never kick in at all for good. I need to check what 
> does archival do for such inflight operations in data table though. 
>  
> So, with spurious deletes support which we added recently, all these can be 
> much simplified. 
> Whenever we want to apply a rollback commit, we don't need to take different 
> actions based on whether the commit being rolled back is already committed to 
> metadata table or not. Just go ahead and apply the rollback. Merging of 
> metadata payload records will take care of this. If the commit was already 
> synced, final merged payload may not have spurious deletes. If the commit 
> being rolledback was never committed to metadata, final merged payload may 
> have some spurious deletes which we can ignore. 
> With this, compaction in metadata does not need to have any dependency on 
> inflight operations in data table. 
> And we can loosen up the dependency of archival in data table on metadata 
> table compaction as well. 
> So, in summary, all the 3 dependencies quoted above will be moot if we go 
> with this approach. 
> Especially our logic to apply rollback metadata to metadata table will become 
> a lot simpler and is easy to reason about. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to