RonnyRen opened a new issue, #5222:
URL: https://github.com/apache/hop/issues/5222

   ### Apache Hop version?
   
   2.12
   
   ### Java version?
   
   18
   
   ### Operating system
   
   Windows
   
   ### What happened?
   
   I used transform "Get data from XML" to process a file that is Windows-1252 
encoding and there is a special character in it,  an error happened as below no 
matter what encoding I used unless I specified encoding in the XML file. (No 
encoding info in the XML decoration)
   Error:
   org.dom4j.DocumentException: Error on line 13 of document 
file:///C:/workspace/hop/windows-1252 : Invalid byte 1 of 1-byte UTF-8 sequence.
   
   I viewed the source code and I think that I found the root cause.
   As the link below, it seems that it uses read function of SAXReader 
incorrectly.  
   
https://github.com/apache/hop/blob/98f86412756517e74ef1fcd5552b62a18d898e4a/plugins/transforms/xml/src/main/java/org/apache/hop/pipeline/transforms/xml/getxmldata/GetXmlData.java#L204
   As document said, the second parameter is systemId not encoding.
   
![Image](https://github.com/user-attachments/assets/3c8e4121-52cb-483e-835f-cc8d48a5e401)
   
   It should use function setEncoding to specify encoding of input source 
before calling read function.
   
   
![Image](https://github.com/user-attachments/assets/9fa2a019-f963-4fd5-bd2a-f0347d497c5b)
   
   
   Please feel free to correct me if something wrong.
   
   Note: XML input stream (Stax) is working with specified encoding.
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: Transforms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to