[ 
https://issues.apache.org/jira/browse/HUDI-7518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7518:
----------------------------
    Description: 
When there are repeated deletes to the partition file list in files partition 
of the MDT, the current HoodieMetadataPayload merging logic drops such 
"deletion", causing the file that is deleted from the file system and supposed 
to be deleted from MDT file listing still left in MDT, because of the following 
logic:
{code:java}
private Map<String, HoodieMetadataFileInfo> 
combineFileSystemMetadata(HoodieMetadataPayload previousRecord) {
    Map<String, HoodieMetadataFileInfo> combinedFileInfo = new HashMap<>();

    // First, add all files listed in the previous record
    if (previousRecord.filesystemMetadata != null) {
      combinedFileInfo.putAll(previousRecord.filesystemMetadata);
    }

    // Second, merge in the files listed in the new record
    if (filesystemMetadata != null) {
      validatePayload(type, filesystemMetadata);

      filesystemMetadata.forEach((key, fileInfo) -> {
        combinedFileInfo.merge(key, fileInfo,
            (oldFileInfo, newFileInfo) ->
                newFileInfo.getIsDeleted()
                    ? null
                    : new 
HoodieMetadataFileInfo(Math.max(newFileInfo.getSize(), oldFileInfo.getSize()), 
false));
      });
    } {code}
Here's a concrete example of how this bug causes the ingestion to fail:

(1) A data file and file group are replaced by clustering.  The data file is 
still on the file system and in MDT file listing.

(2) A cleaner plan is generated to delete the data file.

(3) The cleaner plan is executed the first time, and fails before commit due to 
Spark job shutdown.

(4) The ingestion continues

> Fix HoodieMetadataPayload merging logic around repeated deletes
> ---------------------------------------------------------------
>
>                 Key: HUDI-7518
>                 URL: https://issues.apache.org/jira/browse/HUDI-7518
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.15.0, 1.0.0
>
>
> When there are repeated deletes to the partition file list in files partition 
> of the MDT, the current HoodieMetadataPayload merging logic drops such 
> "deletion", causing the file that is deleted from the file system and 
> supposed to be deleted from MDT file listing still left in MDT, because of 
> the following logic:
> {code:java}
> private Map<String, HoodieMetadataFileInfo> 
> combineFileSystemMetadata(HoodieMetadataPayload previousRecord) {
>     Map<String, HoodieMetadataFileInfo> combinedFileInfo = new HashMap<>();
>     // First, add all files listed in the previous record
>     if (previousRecord.filesystemMetadata != null) {
>       combinedFileInfo.putAll(previousRecord.filesystemMetadata);
>     }
>     // Second, merge in the files listed in the new record
>     if (filesystemMetadata != null) {
>       validatePayload(type, filesystemMetadata);
>       filesystemMetadata.forEach((key, fileInfo) -> {
>         combinedFileInfo.merge(key, fileInfo,
>             (oldFileInfo, newFileInfo) ->
>                 newFileInfo.getIsDeleted()
>                     ? null
>                     : new 
> HoodieMetadataFileInfo(Math.max(newFileInfo.getSize(), 
> oldFileInfo.getSize()), false));
>       });
>     } {code}
> Here's a concrete example of how this bug causes the ingestion to fail:
> (1) A data file and file group are replaced by clustering.  The data file is 
> still on the file system and in MDT file listing.
> (2) A cleaner plan is generated to delete the data file.
> (3) The cleaner plan is executed the first time, and fails before commit due 
> to Spark job shutdown.
> (4) The ingestion continues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to