[ 
https://issues.apache.org/jira/browse/NIFI-11971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serhii Nesterov updated NIFI-11971:
-----------------------------------
    Attachment: image-2023-08-21-13-21-31-091.png

> FlowFile content is corrupted across the whole NiFi instance throughout 
> ProcessSession::write with omitting writing any byte to OutputStream
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-11971
>                 URL: https://issues.apache.org/jira/browse/NIFI-11971
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.23.0, 1.23.1
>            Reporter: Serhii Nesterov
>            Assignee: Mark Payne
>            Priority: Blocker
>              Labels: corruption
>             Fix For: 1.24.0
>
>         Attachments: image-2023-08-20-19-31-16-598.png, 
> image-2023-08-20-19-37-43-772.png, image-2023-08-20-19-38-03-391.png, 
> image-2023-08-20-19-42-37-029.png, image-2023-08-20-19-43-03-697.png, 
> image-2023-08-20-20-01-50-445.png, image-2023-08-21-13-21-31-091.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> One of the scenarios for ProcessSession::write was broken after recent code 
> refactoring within the following pull request: 
> [https://github.com/apache/nifi/pull/7363/files]
> The issue is located in _StandardContentClaimWriteCache.java_ in the 
> _write(final ContentClaim claim)_ method that returns an _OutputStream_ used 
> in the _OutputStreamCallback_ interface to let NiFi processors write flowfile 
> content through the {_}ProcessSession::write method{_}.
> If a processor calls _session.write_ but does not write any data to the 
> output stream, then none of the write methods in the _OutputStream_ is 
> invoked, hence the length of the content claim is not recomputed, meaning the 
> length will have the default value that is equal to {*}-1{*}. Because of the 
> latest refactoring changes that are based on creating a new content claim on 
> each _ProcessSession::write_ invocation the following formula gives the wrong 
> result:
> {code:java}
> previous offset + previous length = new offset.{code}
> or as in the codebase:
> {code:java}
> scc = new StandardContentClaim(scc.getResourceClaim(), scc.getOffset() + 
> scc.getLength());{code}
> For example, if the previous offset was 1000 and nothing was written to the 
> stream (length is -1), then 1000 + (-1) will give us 999 which means that the 
> offset is shifted back by one, hence the next content will have an extra 
> character from the previous content at the beginning and will lose the last 
> character at the end, and all other FlowFiles anywhere in NiFi will be 
> corrupted by this defect until the NiFi instance is restarted.
> The following steps can be taken to reproduce the issue (critical in our 
> commercial project):
>  * Create an empty text file (“a.txt”);
>  * Create a text file with any text (“b.txt”);
>  * Package these files into a .zip archive;
>  * Put it into a file system on Azure Cloud (we use ADLS Gen2);
>  * Read the zip file and unpack its content on the NiFi Canvas using the 
> _FetchAzureDataLakeStorage_ and _UnpackContent_ processors;
>  * Start a flow with the _GenerateFlowFile_ processor. See the results. The 
> empty file must be extracted before the non-empty file, otherwise the issue 
> won’t reproduce. You’ll see that the second FlowFile content will be 
> corrupted – the first character is an unreadable character from the zip 
> archive (last character of the content with zip) fetched with 
> _FetchAzureDataLakeStorage_ and the last character will be lost. Starting 
> from this point, NiFi cannot be used at all because any other processors will 
> lead to FlowFile content corruption across the entire NiFi instance due to 
> the shifted offset.
> A sample canvas:
> !image-2023-08-20-19-31-16-598.png|width=969,height=492!
>  
> Important note: the issue is not reproducible if an empty file is a last file 
> to be extracted (the length will be reset when the processor completes), or 
> if you do not call _session.write()_ when a file has 0 bytes (in case if you 
> create your own processor with such logic).
> The offsets for the above picture will look like as follows (#1 - after 
> fetching and unpacking an empty file, #2 - before unpacking the second file): 
>  !image-2023-08-20-19-37-43-772.png|width=961,height=32!
> !image-2023-08-20-19-38-03-391.png|width=960,height=35!
> 1524 - after FetchAzureDataLakeStorage and UnpackContent for the empty file. 
> The length *-1* will be kept instead of *0* and used for the next file which 
> is why the next offset is equal to 1523 ({*}1524 + (-1) = 1523{*}).
> if your file has the "Hello world" text inside, then after downloading this 
> unpacked file from NiFi you'll see (the first character here is a space):
> !image-2023-08-20-20-01-50-445.png!
> Different processors will give you various errors due to the corrupted 
> content especially for the json format and queries:
> !image-2023-08-20-19-42-37-029.png!
> !image-2023-08-20-19-43-03-697.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to