[
https://issues.apache.org/jira/browse/NIFI-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Turcsanyi resolved NIFI-15969.
------------------------------------
Fix Version/s: 2.10.0
(was: 2.9.0)
Resolution: Fixed
> PutS3Object can corrupt data when two files with the same name are
> simultaneously uploaded with multipart
> ---------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15969
> URL: https://issues.apache.org/jira/browse/NIFI-15969
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Paul Kelly
> Assignee: Rakesh Kumar Singh
> Priority: Major
> Fix For: 2.10.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> This is a very rare edge case, but is something we have seen happen a handful
> of times over the years. It happened again recently and I was able to review
> logs to identify the cause.
> PutS3Object keeps track of its multipart state based only on the bucket name
> and object key. If you try to simultaneously upload two files to a bucket
> and those files have the same name, the various parts will get mixed together
> and the data that ultimately ends up in S3 is corrupt.
> For this to happen, the files have to have the same name, use the same
> tracking directory (either because it's the same node with local storage or
> because it's using a network storage across different nodes), be large enough
> that they get uploaded with multipart, and be large enough that they are both
> uploading at the same time. Because of how the multipart state is tracked,
> it doesn't matter if a single PutS3Object processor is scheduled with
> multiple threads, or if two different PutS3Object processors on the same NiFi
> node happen to upload files with the same names to the same bucket.
> I know this is rare, but there are valid uses for sending data with the same
> name and expecting two versions to end up in the bucket. We see it when we
> use NiFi to download data from a system where we do not control the file
> names, and store those results in a versioned S3 bucket.
> For a versioned bucket, ultimately what should happen is you should end up
> with two different versions of the object, one for each upload. For a
> non-versioned bucket, which ever upload finishes last would replace the first
> object. What is happening is we end up with one corrupt object containing
> parts of both uploads regardless of the versioning.
> I think it would make sense to add the flow file's uuid into the state
> tracking so that the state for two different flow files cannot be mixed
> together. If one flow file needs to retry the upload, most of the time it
> will have the same uuid and PutS3Object can restore the state as it does now.
> If it has a new uuid, the upload will start again from the beginning with a
> fresh state, which is better than ending up with a corrupt file in the
> mentioned scenario.
> There are ways to work around this, of course, such as appending the uuid to
> the key name when uploading the file, but this is not ideal for when we don't
> control the downstream system that ingests from the S3 bucket. Likewise, we
> could also schedule PutS3Object to only run one thread, but this has
> throughput implications, and it still doesn't fix the issue of having two
> PutS3Objects uploading files with the same name at the same time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)