[
https://issues.apache.org/jira/browse/NIFI-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083469#comment-18083469
]
Peter Turcsanyi commented on NIFI-15969:
----------------------------------------
[~pkelly.nifi] The fix has been merged into the Apache NiFi codebase. Could you
please test it when it is available in the next release?
* same processor with multiple threads uploading the same object key: It must
be fixed now.
* different processors uploading the same object key: Each processor has its
own persistence file, with the processor UUID as the filename, so separate
processors should not have interfered with each other earlier either. Please
double check this scenario because there may be another issue that this fix
does not address.
* same processor in cluster using the same network directory as Temporary
Directory Multipart State: This setup must be avoided. Even though the entry
names in the file are now unique, concurrent modification of the file can still
cause corruption, since read/write synchronization happens only within a single
JVM.
> PutS3Object can corrupt data when two files with the same name are
> simultaneously uploaded with multipart
> ---------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15969
> URL: https://issues.apache.org/jira/browse/NIFI-15969
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Paul Kelly
> Assignee: Rakesh Kumar Singh
> Priority: Major
> Fix For: 2.10.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> This is a very rare edge case, but is something we have seen happen a handful
> of times over the years. It happened again recently and I was able to review
> logs to identify the cause.
> PutS3Object keeps track of its multipart state based only on the bucket name
> and object key. If you try to simultaneously upload two files to a bucket
> and those files have the same name, the various parts will get mixed together
> and the data that ultimately ends up in S3 is corrupt.
> For this to happen, the files have to have the same name, use the same
> tracking directory (either because it's the same node with local storage or
> because it's using a network storage across different nodes), be large enough
> that they get uploaded with multipart, and be large enough that they are both
> uploading at the same time. Because of how the multipart state is tracked,
> it doesn't matter if a single PutS3Object processor is scheduled with
> multiple threads, or if two different PutS3Object processors on the same NiFi
> node happen to upload files with the same names to the same bucket.
> I know this is rare, but there are valid uses for sending data with the same
> name and expecting two versions to end up in the bucket. We see it when we
> use NiFi to download data from a system where we do not control the file
> names, and store those results in a versioned S3 bucket.
> For a versioned bucket, ultimately what should happen is you should end up
> with two different versions of the object, one for each upload. For a
> non-versioned bucket, which ever upload finishes last would replace the first
> object. What is happening is we end up with one corrupt object containing
> parts of both uploads regardless of the versioning.
> I think it would make sense to add the flow file's uuid into the state
> tracking so that the state for two different flow files cannot be mixed
> together. If one flow file needs to retry the upload, most of the time it
> will have the same uuid and PutS3Object can restore the state as it does now.
> If it has a new uuid, the upload will start again from the beginning with a
> fresh state, which is better than ending up with a corrupt file in the
> mentioned scenario.
> There are ways to work around this, of course, such as appending the uuid to
> the key name when uploading the file, but this is not ideal for when we don't
> control the downstream system that ingests from the S3 bucket. Likewise, we
> could also schedule PutS3Object to only run one thread, but this has
> throughput implications, and it still doesn't fix the issue of having two
> PutS3Objects uploading files with the same name at the same time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)