Paul Kelly created NIFI-15969:
---------------------------------
Summary: PutS3Object can corrupt data when two files with the same
name are simultaneously uploaded with multipart
Key: NIFI-15969
URL: https://issues.apache.org/jira/browse/NIFI-15969
Project: Apache NiFi
Issue Type: Bug
Reporter: Paul Kelly
Fix For: 2.9.0
This is a very rare edge case, but is something we have seen happen a handful
of times over the years. It happened again recently and I was able to review
logs to identify the cause.
PutS3Object keeps track of its multipart state based only on the bucket name
and object key. If you try to simultaneously upload two files to a bucket and
those files have the same name, the various parts will get mixed together and
the data that ultimately ends up in S3 is corrupt.
For this to happen, the files have to have the same name, use the same tracking
directory (either because it's the same node with local storage or because it's
using a network storage across different nodes), be large enough that they get
uploaded with multipart, and be large enough that they are both uploading at
the same time. Because of how the multipart state is tracked, it doesn't
matter if a single PutS3Object processor is scheduled with multiple threads, or
if two different PutS3Object processors on the same NiFi node happen to upload
files with the same names to the same bucket.
I know this is rare, but there are valid uses for sending data with the same
name and expecting two versions to end up in the bucket. We see it when we use
NiFi to download data from a system where we do not control the file names, and
store those results in a versioned S3 bucket.
For a versioned bucket, ultimately what should happen is you should end up with
two different versions of the object, one for each upload. For a non-versioned
bucket, which ever upload finishes last would replace the first object. What
is happening is we end up with one corrupt object containing parts of both
uploads regardless of the versioning.
I think it would make sense to add the flow file's uuid into the state tracking
so that the state for two different flow files cannot be mixed together. If
one flow file needs to retry the upload, most of the time it will have the same
uuid and PutS3Object can restore the state as it does now. If it has a new
uuid, the upload will start again from the beginning with a fresh state, which
is better than ending up with a corrupt file in the mentioned scenario.
There are ways to work around this, of course, such as appending the uuid to
the key name when uploading the file, but this is not ideal for when we don't
control the downstream system that ingests from the S3 bucket. Likewise, we
could also schedule PutS3Object to only run one thread, but this has throughput
implications, and it still doesn't fix the issue of having two PutS3Objects
uploading files with the same name at the same time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)