[ 
https://issues.apache.org/jira/browse/NIFI-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085295#comment-18085295
 ] 

Paul Kelly commented on NIFI-15969:
-----------------------------------

Thank you [~rakesh03] and [~turcsanyip] for your quick help with this.  I just 
built an updated nar based on this PR and confirmed that it fixes the issue 
with a single processor and multiple threads.  That was the big one for us, so 
thank you for fixing it.  I'll review the other scenario where two different 
processors are involved and open a new issue if there is still another problem. 
 It's likely that I misinterpreted something from one of my tests.

> PutS3Object can corrupt data when two files with the same name are 
> simultaneously uploaded with multipart
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15969
>                 URL: https://issues.apache.org/jira/browse/NIFI-15969
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Paul Kelly
>            Assignee: Rakesh Kumar Singh
>            Priority: Major
>             Fix For: 2.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a very rare edge case, but is something we have seen happen a handful 
> of times over the years.  It happened again recently and I was able to review 
> logs to identify the cause.
> PutS3Object keeps track of its multipart state based only on the bucket name 
> and object key.  If you try to simultaneously upload two files to a bucket 
> and those files have the same name, the various parts will get mixed together 
> and the data that ultimately ends up in S3 is corrupt.
> For this to happen, the files have to have the same name, use the same 
> tracking directory (either because it's the same node with local storage or 
> because it's using a network storage across different nodes), be large enough 
> that they get uploaded with multipart, and be large enough that they are both 
> uploading at the same time.  Because of how the multipart state is tracked, 
> it doesn't matter if a single PutS3Object processor is scheduled with 
> multiple threads, or if two different PutS3Object processors on the same NiFi 
> node happen to upload files with the same names to the same bucket.
> I know this is rare, but there are valid uses for sending data with the same 
> name and expecting two versions to end up in the bucket.  We see it when we 
> use NiFi to download data from a system where we do not control the file 
> names, and store those results in a versioned S3 bucket.
> For a versioned bucket, ultimately what should happen is you should end up 
> with two different versions of the object, one for each upload.  For a 
> non-versioned bucket, which ever upload finishes last would replace the first 
> object.  What is happening is we end up with one corrupt object containing 
> parts of both uploads regardless of the versioning.
> I think it would make sense to add the flow file's uuid into the state 
> tracking so that the state for two different flow files cannot be mixed 
> together.  If one flow file needs to retry the upload, most of the time it 
> will have the same uuid and PutS3Object can restore the state as it does now. 
>  If it has a new uuid, the upload will start again from the beginning with a 
> fresh state, which is better than ending up with a corrupt file in the 
> mentioned scenario.
> There are ways to work around this, of course, such as appending the uuid to 
> the key name when uploading the file, but this is not ideal for when we don't 
> control the downstream system that ingests from the S3 bucket.  Likewise, we 
> could also schedule PutS3Object to only run one thread, but this has 
> throughput implications, and it still doesn't fix the issue of having two 
> PutS3Objects uploading files with the same name at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to