[ https://issues.apache.org/jira/browse/NIFI-12750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Burgess updated NIFI-12750: -------------------------------- Fix Version/s: 1.26.1 2.0.0-M4 Resolution: Fixed Status: Resolved (was: Patch Available) > ExecuteStreamCommand incorrectly decodes error stream > ----------------------------------------------------- > > Key: NIFI-12750 > URL: https://issues.apache.org/jira/browse/NIFI-12750 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 1.25.0, 2.0.0-M2 > Environment: any > Reporter: René Zeidler > Assignee: Jim Steinebrey > Priority: Major > Labels: ExecuteStreamCommand, encoding, iso-8859-1, utf-8 > Fix For: 1.26.1, 2.0.0-M4 > > Attachments: ExecuteStreamCommand_Encoding_Bug.json, encodingTest.sh, > image-2024-02-07-15-14-08-518.png, image-2024-02-07-15-14-54-841.png, > image-2024-02-07-15-20-11-684.png > > > h1. Summary > The ExecuteStreamCommand processor stores everything the invoked command > writes to the error stream (stderr) into the FlowFile attribute > {{{}execution.error{}}}. > When converting the bytes from the stream to a String, it interprets each > individual byte as a Unicode codepoint. When reading only single bytes this > effectively results in ISO-8859-1 (Latin-1). > Instead, it should use the system default encoding (like it already does for > writing stdout if Output Destination Attribute is set) or use a configurable > encoding (for both stdout and stderr). > h1. Details > When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding > issues are the responsibility of the flow designer, and NiFi has the > ConvertCharacterSet processor to deal with those issues. > When writing to attributes, the API uses Java String objects, which are > encoding agnostic (they represent Unicode codepoints, not bytes). Therefore, > processors receiving bytes have to interpret them using an encoding. > The ExecuteStreamCommand processor writes the output of the command (stdout) > to the Output Destination Attribute (if set). To do that, it convertes bytes > into a String using the system default encoding* by calling {{new String}} > without an encoding argument: > [https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499] > When converting stderr to a String to write into the {{execution.error}} > attribute, it uses this weird algorithm: > [https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517] > It reads individual bytes from the error stream (as {{{}int{}}}s) and casts > them to {{{}char{}}}s. What Java does in this case is interpret the integer > as a Unicode code point. For single bytes, this matches the ISO-8859-1 > encoding. Instead, it should use the same decoding method as for stdout. > h1. Reproduction steps > These steps are for a Linux environment, but can be adapted with a different > executable for Windows. > # Create the file /opt/nifi/data/encodingTest.sh (attached) with the > following contents and make it executable: > {quote}{{#/bin/bash}} > {{echo "|out static: ÄÖÜäöüß"}} > {{{}echo "|error static: ÄÖÜäöüß" >&2{}}}{{{}echo "|out arg: $1"{}}} > {{{}echo "|error arg: $1" >&2{}}}{{{}echo "|out arg hexdump:"{}}} > {{printf '%s' "$1" | od -A x -t x1z -v}} > {{echo "|error arg hexdump:" >&2}} > {{printf '%s' "$1" | od -A x -t x1z -v >&2}}{quote}The script writes > identical data to both stdout and stderr. It contains non-ASCII characters to > make the encoding issues visible. > # Import the attached flow or create it manually: > !image-2024-02-07-15-14-08-518.png|width=324,height=373!!image-2024-02-07-15-14-54-841.png|width=326,height=120! > # Run the GenerateFlowFile processor once and observe the attributes of the > FlowFile in the final queue: > !image-2024-02-07-15-20-11-684.png|width=523,height=195! > The output attribute (stdout) is correctly decoded. The execution.error > attribute (stderr) contains garbled text (UTF-8 bytes interpreted as > ISO-8859-1 and reencoded in UTF-8). > h1. *On the system default encoding > The system default encoding is a property of the JVM. It is UTF-8 on Linux, > but Windows-1252 (or a different copepage depending on locale) in Windows > environments. It can be overriden using the {{file.encoding}} JVM arg on > startup. > Relying on the system default encoding is dangerous and can lead to subtle > bugs, like the ones I previously reported (NIFI-12669 and NIFI-12670). > In this case, it might make sense to use the system default encoding, as it > concerns data passed between NiFi and another process that runs on the host > system. Also, the ProcessBuilder class used the create the process always > passes arguments in the system default encoding, and there doesn't seem a way > to change that. This behavior should probably be documented. -- This message was sent by Atlassian Jira (v8.20.10#820010)