[
https://issues.apache.org/jira/browse/NIFI-11971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Serhii Nesterov updated NIFI-11971:
-----------------------------------
Attachment: image-2023-08-20-19-42-37-029.png
> FlowFile content is corrupted across the whole NiFi instance throughout
> ProcessSession::write with omitting writing any byte to OutputStream
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-11971
> URL: https://issues.apache.org/jira/browse/NIFI-11971
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.23.0, 1.23.1
> Reporter: Serhii Nesterov
> Priority: Critical
> Attachments: image-2023-08-20-19-31-16-598.png,
> image-2023-08-20-19-37-43-772.png, image-2023-08-20-19-38-03-391.png,
> image-2023-08-20-19-42-37-029.png
>
>
> One of the scenarios for ProcessSession::write was broken after recent code
> refactoring within the following pull request:
> [https://github.com/apache/nifi/pull/7363/files]
> The issue is located in StandardContentClaimWriteCache.java in the
> write(final ContentClaim claim) method that returns an OutputStream used in
> the OutputStreamCallback interface to let NiFi processors write flowfile
> content through the ProcessSession::write method.
> If a processor calls session.write but does not write any data to the output
> stream, then none of the write methods in the OutputStream is invoked, hence
> the length of the content claim is not recomputed which means the length will
> have the default value that is equal to -1. Because of the latest refactoring
> changes that are based on creating a new content claim on each
> ProcessSession::write invocation the following formula gives the wrong result:
> previous offset + previous length = new offset.
> For example, if the previous offset was 1000 and nothing was written to the
> stream (length is -1), then 1000 + (-1) will give us 999 which means that the
> offset is shifted back by one, hence the next content will have an extra
> character from the previous content at the beginning and will lose the last
> character at the end, and all other FlowFiles anywhere in NiFi will be
> corrupted by this defect until the NiFi instance is restarted.
> The following steps can be taken to reproduce the issue (critical in our
> commercial project):
> * Create an empty text file (“a.txt”);
> * Create a text file with any text (“b.txt”);
> * Package these files into a .zip archive;
> * Put it into a file system on Azure Cloud (we use ADLS Gen2);
> * Read the zip file and unpack it content on the NiFi Canvas using
> FetchAzureDataLakeStorage and UnpackContent processors;
> * Start a flow with the GenerateFlowFile processor. See the results. The
> empty file must be extracted before the non-empty file, otherwise the issue
> won’t reproduce. You’ll see that the second FlowFile content will be
> corrupted – the first character is an unreadable character from the zip
> archive (last character of the content with zip) fetched with
> FetchAzureDataLakeStorage and the last character will be lost. Starting from
> this point, NiFi cannot be used at all because any other processors will lead
> to FlowFile content corruption across the entire NiFi instance due to the
> shifted offset.
> A sample canvas:
> !image-2023-08-20-19-31-16-598.png|width=969,height=492!
>
> Important note: the issue is not reproducible if an empty file is a last file
> to be extracted (the length will be reset when the processor completes), or
> if you do not call session.write() when a file has 0 bytes (in case you
> create your own processoor).
> The offsets for the above picture will look like as follows:
> !image-2023-08-20-19-37-43-772.png|width=961,height=32!
> !image-2023-08-20-19-38-03-391.png|width=960,height=35!
> 1524 - after FetchAzureDataLakeStorage and UnpackContent for the empty file.
> Instead of 0, 1 will be kept and used for the next file which is why next
> offset is 1523 (1524 + (1) = 1523).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)