Just to document it for others with the same problem. When the body is passed 
as byte array, the bytes are correct.

    public String detectEncodingByBom(@Body byte[] body) {
        byte[] firstThreeBytes = Arrays.copyOfRange(body, 0, 3);
        log.debug("3 Bytes as Hex: " + Hex.encodeHexString(firstThreeBytes));
        ...
    }

The log output for a UTF-16LE file is "fffe3c". This is the correct BOM (FFFE) 
and the first byte of the first character "<".

Stephan


-----Ursprüngliche Nachricht-----
Von: Burkard Stephan 
Gesendet: Donnerstag, 4. Mai 2017 16:08
An: 'users@camel.apache.org'
Betreff: AW: Charset on file poller endpoint

Yes, a Bean is probably the best way to do the work. 

However, I tried to inject the exchange, get the body as InputStream and read 
the first 4 bytes from the body (because an InputStream is a byte 
representation and therefore not encoded). When I read a file that is UTF-16 
(Big endian) encoded, I get the output "Hex: efbfbdef"

    public void determineEncoding(Exchange exchange) throws Exception {
        InputStream is = exchange.getIn().getBody(InputStream.class);
        DataInputStream dis = new DataInputStream(is);
        int fourBytes = dis.readInt();
        String hex = Integer.toHexString(fourBytes);
        log.info("Hex: " + hex);
    }

But when I read the file directly, I get the output "Hex: feff003c"

    public void testUtf16BeBom() throws Exception {
        InputStream utf16FileStream = 
this.getClass().getClassLoader().getResourceAsStream("testfiles/XmlUtf16Be.xml");
        DataInputStream dis = new DataInputStream(utf16FileStream);
        int fourBytes = dis.readInt();
        String hex = Integer.toHexString(fourBytes);
        log.info("Hex: " + hex);
    }

The output of the direct read is correct since "feff" is the UTF-16 BE BOM, 
followed by "003c" which is the first character "<" in a 2-byte representation. 

Any idea why the output through the Camel route/Bean is wrong? Is it because 
the body has already be encoded (with a wrong encoding)?

Thanks
Stephan
  

-----Ursprüngliche Nachricht-----
Von: souciance [mailto:souciance.eqdam.ras...@gmail.com]
Gesendet: Donnerstag, 4. Mai 2017 12:13
An: users@camel.apache.org
Betreff: Re: Charset on file poller endpoint

Probably the easiest is to read the file and send the exchange to a bean.
In the bean try to read it and determine the encoding and if it has a BOM 
character. Finally do your conversion and put the body back to the exchange.

from(file:/myDir)
.to(DetermineEncoding.class, "determineEncoding")
.to(activemq:queue:myQueue)

On Thu, May 4, 2017 at 12:01 PM, Burkard Stephan [via Camel] <
ml+s465427n5798625...@n5.nabble.com> wrote:

> Hi Camel users
>
> I read files with a Camel file poller and they can have different 
> encodings (UTF-8 with or without BOM, UTF-16). Therefore I would like 
> to determine the given encoding and convert the message body to UTF-8 
> without BOM for the further processing.
>
> How can I do this and what is exactly the result in the message 
> payload in the exchange? Is it payload an inputstream (just bytes, no
> encoding) or is it already converted to a string or a reader (already 
> encoded).
>
> And what does the "charset" option change? Does it overwrite the 
> default encoding of the operating system?
>
> from(file:/myDir)
> // can I read here the first bytes of the file?
> .to(activemq:queue:myQueue)
>
> Thanks for any hints
> Stephan
>
>
> ------------------------------
> If you reply to this email, your message will be added to the 
> discussion
> below:
> http://camel.465427.n5.nabble.com/Charset-on-file-poller-
> endpoint-tp5798625.html
> To start a new topic under Camel - Users, email ml+s465427n465428h31@n5.
> nabble.com
> To unsubscribe from Camel - Users, click here 
> <http://camel.465427.n5.nabble.com/template/NamlServlet.jtp?macro=unsu
> bscribe_by_code&node=465428&code=c291Y2lhbmNlLmVxZGFtLnJhc2h0aUBnbWFpb
> C5jb218NDY1NDI4fDE1MzI5MTE2NTY=>
> .
> NAML
> <http://camel.465427.n5.nabble.com/template/NamlServlet.jtp?macro=macr
> o_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namesp
> aces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.vi
> ew.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%
> 3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%2
> 1nabble%3Aemail.naml>
>




--
View this message in context: 
http://camel.465427.n5.nabble.com/Charset-on-file-poller-endpoint-tp5798625p5798627.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Reply via email to