[ 
https://issues.apache.org/jira/browse/NIFI-12750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard updated NIFI-12750:
----------------------------------
    Fix Version/s: 2.0.0-M4

> ExecuteStreamCommand incorrectly decodes error stream
> -----------------------------------------------------
>
>                 Key: NIFI-12750
>                 URL: https://issues.apache.org/jira/browse/NIFI-12750
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.25.0, 2.0.0-M2
>         Environment: any
>            Reporter: René Zeidler
>            Assignee: Jim Steinebrey
>            Priority: Major
>              Labels: ExecuteStreamCommand, encoding, iso-8859-1, utf-8
>             Fix For: 2.0.0-M4, 1.27.0
>
>         Attachments: ExecuteStreamCommand_Encoding_Bug.json, encodingTest.sh, 
> image-2024-02-07-15-14-08-518.png, image-2024-02-07-15-14-54-841.png, 
> image-2024-02-07-15-20-11-684.png
>
>
> h1. Summary
> The ExecuteStreamCommand processor stores everything the invoked command 
> writes to the error stream (stderr) into the FlowFile attribute 
> {{{}execution.error{}}}.
> When converting the bytes from the stream to a String, it interprets each 
> individual byte as a Unicode codepoint. When reading only single bytes this 
> effectively results in ISO-8859-1 (Latin-1).
> Instead, it should use the system default encoding (like it already does for 
> writing stdout if Output Destination Attribute is set) or use a configurable 
> encoding (for both stdout and stderr).
> h1. Details
> When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding 
> issues are the responsibility of the flow designer, and NiFi has the 
> ConvertCharacterSet processor to deal with those issues.
> When writing to attributes, the API uses Java String objects, which are 
> encoding agnostic (they represent Unicode codepoints, not bytes). Therefore, 
> processors receiving bytes have to interpret them using an encoding.
> The ExecuteStreamCommand processor writes the output of the command (stdout) 
> to the Output Destination Attribute (if set). To do that, it convertes bytes 
> into a String using the system default encoding* by calling {{new String}} 
> without an encoding argument:
> [https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499]
> When converting stderr to a String to write into the {{execution.error}} 
> attribute, it uses this weird algorithm:
> [https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517]
> It reads individual bytes from the error stream (as {{{}int{}}}s) and casts 
> them to {{{}char{}}}s. What Java does in this case is interpret the integer 
> as a Unicode code point. For single bytes, this matches the ISO-8859-1 
> encoding. Instead, it should use the same decoding method as for stdout.
> h1. Reproduction steps
> These steps are for a Linux environment, but can be adapted with a different 
> executable for Windows.
>  # Create the file /opt/nifi/data/encodingTest.sh (attached) with the 
> following contents and make it executable:
> {quote}{{#/bin/bash}}
> {{echo "|out static: ÄÖÜäöüß"}}
> {{{}echo "|error static: ÄÖÜäöüß" >&2{}}}{{{}echo "|out arg: $1"{}}}
> {{{}echo "|error arg: $1" >&2{}}}{{{}echo "|out arg hexdump:"{}}}
> {{printf '%s' "$1" | od -A x -t x1z -v}}
> {{echo "|error arg hexdump:" >&2}}
> {{printf '%s' "$1" | od -A x -t x1z -v >&2}}{quote}The script writes 
> identical data to both stdout and stderr. It contains non-ASCII characters to 
> make the encoding issues visible.
>  # Import the attached flow or create it manually:
> !image-2024-02-07-15-14-08-518.png|width=324,height=373!!image-2024-02-07-15-14-54-841.png|width=326,height=120!
>  # Run the GenerateFlowFile processor once and observe the attributes of the 
> FlowFile in the final queue:
> !image-2024-02-07-15-20-11-684.png|width=523,height=195!
> The output attribute (stdout) is correctly decoded. The execution.error 
> attribute (stderr) contains garbled text (UTF-8 bytes interpreted as 
> ISO-8859-1 and reencoded in UTF-8).
> h1. *On the system default encoding
> The system default encoding is a property of the JVM. It is UTF-8 on Linux, 
> but Windows-1252 (or a different copepage depending on locale) in Windows 
> environments. It can be overriden using the {{file.encoding}} JVM arg on 
> startup.
> Relying on the system default encoding is dangerous and can lead to subtle 
> bugs, like the ones I previously reported (NIFI-12669 and NIFI-12670).
> In this case, it might make sense to use the system default encoding, as it 
> concerns data passed between NiFi and another process that runs on the host 
> system. Also, the ProcessBuilder class used the create the process always 
> passes arguments in the system default encoding, and there doesn't seem a way 
> to change that. This behavior should probably be documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to