[ 
https://issues.apache.org/jira/browse/NIFI-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083460#comment-18083460
 ] 

ASF subversion and git services commented on NIFI-15969:
--------------------------------------------------------

Commit 74c07276a71bae160da571058eebd5c4196f7c09 in nifi's branch 
refs/heads/main from Rakesh Kumar Singh
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=74c07276a71 ]

NIFI-15969 Fixed PutS3Object multipart upload data corruption for concurrent 
FlowFiles with same S3 key

Previously the multipart upload state was tracked using only the processor 
identifier,
bucket name, and object key. When two FlowFiles with the same name were uploaded
concurrently to the same bucket, they shared the same state tracking key, 
causing
parts from different uploads to be interleaved and resulting in a corrupt S3 
object.

Included the FlowFile UUID in the state tracking key so each FlowFile maintains
its own independent multipart upload state. Retries of the same FlowFile retain
the same UUID and continue to benefit from state resumption. A FlowFile with a
new UUID starts a fresh upload rather than inheriting stale state.

This closes #11279.

Signed-off-by: Peter Turcsanyi <[email protected]>


> PutS3Object can corrupt data when two files with the same name are 
> simultaneously uploaded with multipart
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15969
>                 URL: https://issues.apache.org/jira/browse/NIFI-15969
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Paul Kelly
>            Assignee: Rakesh Kumar Singh
>            Priority: Major
>             Fix For: 2.9.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a very rare edge case, but is something we have seen happen a handful 
> of times over the years.  It happened again recently and I was able to review 
> logs to identify the cause.
> PutS3Object keeps track of its multipart state based only on the bucket name 
> and object key.  If you try to simultaneously upload two files to a bucket 
> and those files have the same name, the various parts will get mixed together 
> and the data that ultimately ends up in S3 is corrupt.
> For this to happen, the files have to have the same name, use the same 
> tracking directory (either because it's the same node with local storage or 
> because it's using a network storage across different nodes), be large enough 
> that they get uploaded with multipart, and be large enough that they are both 
> uploading at the same time.  Because of how the multipart state is tracked, 
> it doesn't matter if a single PutS3Object processor is scheduled with 
> multiple threads, or if two different PutS3Object processors on the same NiFi 
> node happen to upload files with the same names to the same bucket.
> I know this is rare, but there are valid uses for sending data with the same 
> name and expecting two versions to end up in the bucket.  We see it when we 
> use NiFi to download data from a system where we do not control the file 
> names, and store those results in a versioned S3 bucket.
> For a versioned bucket, ultimately what should happen is you should end up 
> with two different versions of the object, one for each upload.  For a 
> non-versioned bucket, which ever upload finishes last would replace the first 
> object.  What is happening is we end up with one corrupt object containing 
> parts of both uploads regardless of the versioning.
> I think it would make sense to add the flow file's uuid into the state 
> tracking so that the state for two different flow files cannot be mixed 
> together.  If one flow file needs to retry the upload, most of the time it 
> will have the same uuid and PutS3Object can restore the state as it does now. 
>  If it has a new uuid, the upload will start again from the beginning with a 
> fresh state, which is better than ending up with a corrupt file in the 
> mentioned scenario.
> There are ways to work around this, of course, such as appending the uuid to 
> the key name when uploading the file, but this is not ideal for when we don't 
> control the downstream system that ingests from the S3 bucket.  Likewise, we 
> could also schedule PutS3Object to only run one thread, but this has 
> throughput implications, and it still doesn't fix the issue of having two 
> PutS3Objects uploading files with the same name at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to