[ https://issues.apache.org/jira/browse/CAMEL-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245555#comment-16245555 ]
Robert Half commented on CAMEL-11846: ------------------------------------- Hi Viral, I have a workaround first: I use BufferedInputStream wrapper, so I am able to reset it later (don't need to open the file twice). I give the InputStream to XmlStreamReader, which gives me the encoding after reading XML file prolog. Then I set it for camel on the Exchange.CHARSET_NAME header: EncodingUtil.DetectedEncodingStream detectedEncodingStream = EncodingUtil.detectEncoding(inputStream, new StaxConverter().getInputFactory()); inputStream = detectedEncodingStream.inputStream; exchange.getIn().setHeader(Exchange.CHARSET_NAME, detectedEncodingStream.encoding); {code:java} public class EncodingUtil { public static class DetectedEncodingStream { public InputStream inputStream; public String encoding; public DetectedEncodingStream(InputStream inputStream, String encoding) { this.inputStream = inputStream; this.encoding = encoding; } } private static final int MAX_REWINDABLE_STREAM_BUFFER = 2*4196; public static final Logger LOGGER = LoggerFactory.getLogger(EncodingUtil.class); public static DetectedEncodingStream detectEncoding(InputStream inputStream, XMLInputFactory xmlInputFactory) { final BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, MAX_REWINDABLE_STREAM_BUFFER); bufferedInputStream.mark(MAX_REWINDABLE_STREAM_BUFFER); String encoding; XMLStreamReader xmlStreamReader = null; try { xmlStreamReader = xmlInputFactory.createXMLStreamReader(bufferedInputStream); } catch (XMLStreamException e) { throw new RuntimeException(e); } finally { try { bufferedInputStream.reset(); } catch (IOException e) { throw new RuntimeException(e); } finally { try { xmlStreamReader.close(); } catch (XMLStreamException e) { throw new RuntimeException("Failed to close XmlStreamRader", e); } } } encoding = xmlStreamReader.getCharacterEncodingScheme(); if (encoding == null) { encoding = StandardCharsets.UTF_8.name(); } return new DetectedEncodingStream(bufferedInputStream, encoding); } } {code} > xtokenize and apply xslt to a string does not work with UTF-16BE > ----------------------------------------------------------------- > > Key: CAMEL-11846 > URL: https://issues.apache.org/jira/browse/CAMEL-11846 > Project: Camel > Issue Type: Bug > Components: camel-core > Affects Versions: 2.17.5 > Reporter: Robert Half > > In XML, encoding is often provided inside <?xml ..?> tag. In general, you > cannot read the tag, if you don't know the encoding, but XML Parsers support > the detection of several encodings which allows them to read the tag. With > that information they can read the whole file without knowing the "charset" > in first place. > xtokenize and xslt use XmlInputFactory#createXmlStreamReader(Reader). But by > providing a reader Camel tells, that it knows the encoding, so it won't be > detected by the XML parser. > Also Camel sets the charset to UTF-8 if it is not provided inside a header. > This makes the underlying reader fail reading UTF-16. > Using XmlInputFactory#createXmlStreamReader(InputStream) inside > XMLTokenExpressionIterator works (tried in a patch). But the next xslt steps > fails again because it again uses a Reader. > See Stackoverflow Question for reference: > [https://stackoverflow.com/questions/46322376/apache-camel-to-handle-encoding-declared-in-xml-file] -- This message was sent by Atlassian JIRA (v6.4.14#64029)