[GitHub] [hudi] codope commented on pull request #3142: [HUDI-1483] Support async clustering for deltastreamer and Spark streaming

GitBox Tue, 13 Jul 2021 09:18:18 -0700


codope commented on pull request #3142:
URL: https://github.com/apache/hudi/pull/3142#issuecomment-879223675



   > Hi @codope Just want to know, is this Async clustering function can handle 
the following scenarios and losing no data:
   > 
   > There are 3 small file groups named fg1, fg2 and fg3 contained file 
slice1, file slice2 and file slices3 separately.
   > 
   > When async schedule **start to make a cluster plan but not finished**, 
there is an inflight or requested commit for fg1 which will create file slice 
11 based on file slice1. In other words **file slice11 is creating but not 
committed** ---> I believe this scene is similar to multi writers.
   > 
   > What does this async clustering function will do?
   > Will this clustering plan contains file slice1? if contained, I think the 
new data in file slice11 will be lost.
   > 
   > Looking forward to your reply, thanks a lot.
   
   @zhangyue19921010 It will depend on what point of time during clustering 
planning file slice11 is created. If it is before the 
`ClusteringPlanStrategy#getFileSlicesEligibleForClustering` is invoked then 
clustering plan will not contain file slice1. So, just like multi writers there 
is a race condition here. However, while actually clustering, the default (and 
currently only) strategy is to reject updates. So, it will throw exception 
after seeing that there is an a filegroup with update (in this case fg1). This 
should get picked up in the next run of clustering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #3142: [HUDI-1483] Support async clustering for deltastreamer and Spark streaming

Reply via email to