vinothchandar commented on code in PR #18259: URL: https://github.com/apache/hudi/pull/18259#discussion_r2862112031
########## rfc/rfc-100/rfc-100.md: ########## @@ -201,6 +201,17 @@ To identify these references, we have three options: **Option 1 will be implemented in milestone 1.** +**Implementation Details**: + +The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore. + +The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage. +The blob deletion must therefore happen before removing the files marked for deletion. If the cleaner crashes during execution, we should be able to re-run the plan in an idempotent manner. To account for this, we can skip any files that are already deleted when searching for de-referenced blobs. + +If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations: +- For files that are being removed but have a newer file slice for the file group, we can limit the search to files within the same file group. +- For files that are being removed and do not have a newer file slice for the file group (this will occur during replace commits & clustering), we will need to inspect all the retained files in the partition that were created after the creation of the removed file slice since the data can move between file groups within the same partition. Review Comment: Same as above, the problematic race, except R is a concurrent replacecommit that cleaning execution does not see.. ########## rfc/rfc-100/rfc-100.md: ########## @@ -201,6 +201,17 @@ To identify these references, we have three options: **Option 1 will be implemented in milestone 1.** +**Implementation Details**: + +The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore. + +The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage. Review Comment: yeah, agree. if its all correct, then we should limit the view. so its determinisitic and isolated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
