[ 
https://issues.apache.org/jira/browse/NIFI-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846339#comment-17846339
 ] 

ASF subversion and git services commented on NIFI-12669:
--------------------------------------------------------

Commit 47101f760e60f2a93e524c2adb01ebe29ebb754d in nifi's branch 
refs/heads/support/nifi-1.x from Jim Steinebrey
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=47101f760e ]

NIFI-12669 Fix EvaluateXQuery processor which incorrectly encodes result 
attributes in certain case

Signed-off-by: Matt Burgess <mattyb...@apache.org>


> EvaluateXQuery processor incorrectly encodes result attributes
> --------------------------------------------------------------
>
>                 Key: NIFI-12669
>                 URL: https://issues.apache.org/jira/browse/NIFI-12669
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Configuration, Extensions
>         Environment: JVM with non-UTF-8 default encoding (e.g. default 
> Windows installation)
>            Reporter: René Zeidler
>            Assignee: Jim Steinebrey
>            Priority: Major
>              Labels: encoding, utf8, windows, xml
>         Attachments: EvaluateXQuery_Encoding_Bug.json, 
> image-2024-01-25-10-24-17-005.png, image-2024-01-25-10-31-35-200.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h2. Environment
> This issue affects environments where the JVM default encoding is not 
> {{{}UTF-8{}}}. Standard Java installations on Windows are affected, as they 
> usually use the default encoding {{{}windows-1252{}}}. To reproduce the issue 
> on Linux, change the default encoding to {{windows-1252}} by adding the 
> following line to your {{{}bootstrap.conf{}}}:
> {quote}{{java.arg.21=-Dfile.encoding=windows-1252}}
> {quote}
> h2. Summary
> The EvaluateXQuery incorrectly encodes result values when storing them in 
> attributes. This causes non-ASCII characters to be garbled.
> Example:
> !image-2024-01-25-10-24-17-005.png!
> h2. Steps to reproduce
>  # Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
>  # Create a GenerateFlowFile processor with the following content:
> {quote}{{<?xml version="1.0" encoding="UTF-8"?>}}
> {{<myRoot>}}
> {{  <myData>This text contains non-ASCII characters: ÄÖÜäöüßéèóò</myData>}}
> {{</myRoot>}}
> {quote}
>  # Connect the processor to an EvaluateXQuery processor.
> Set the {{Destination}} to {{{}flowfile-attribute{}}}.
> Create a custom property {{myData}} with value {{{}string(/myRoot/myData){}}}.
>  # Connect the outputs of the EvaluateXQuery processor to funnels to be able 
> to observe the result in the queue.
>  # Start the EvaluateXQuery processor and run the GenerateFlowFile processor 
> once.
> The flow should look similar to this:
> !image-2024-01-25-10-31-35-200.png!
> I also attached a JSON export of the example flow.
>  # Observe the attributes of the resulting FlowFile in the queue.
> h3. Expected Result
> The FlowFile should contain an attribute {{myData}} with the value {{{}"This 
> text contains non-ASCII characters: ÄÖÜäöüßéèóò"{}}}.
> h3. Actual Result
> The attribute has the value "This text contains non-ASCII characters: 
> ÄÖÜäöüßéèóò".
> h2. Root Cause Analysis
> EvaluateXQuery uses the method 
> [{{formatItem}}|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L368-L372]
>  to write the query result to an attribute. This method calls 
> {{{}ByteArrayOutputStream{}}}'s 
> [toString|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/ByteArrayOutputStream.html#toString()]
>  method without an encoding argument, which then defaults to the default 
> charset of the environment. Bytes are always written to this output stream 
> using UTF-8 
> ([.getBytes(StandardCharsets.UTF8)|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L397]).
>  When the default charset is not UTF-8, this results in UTF-8 bytes to be 
> interpreted in a different encoding when converting to a string, resulting in 
> garbled text (see above).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to