>       Hi,
> 
>       When Java reads a stream of bytes into characters and encounters
a
> character outside of the encoding (e.g. not in the ISO-8859-1
character
> set) it replaces the character with a '?'. I believe this behaviour is
> configurable, but I don't know how (you might have to register your
own
> converter). By the time Xerces (or Xalan) sees the character, it's too
> late. I'm not sure where you configure it, but looking at the source
code,
> it's a 'substitution mode' flag - there are methods on
CharToByteConverter
> (and ByteToCharConverter if you're going the other way) to set it, but
I'm
> not sure how you can set it in your case. If you set it to 'false',
the
> converter will throw an exception if it encounters an unmappable byte
> sequence (or charater).
> 
>       Chris

java reads everything in utf-16 unless you tell it otherwise...

so, if your xml is acutally encoded in is0-8859-1 and there's no
encoding specified in the declaration - the parser will attempt to read
it in utf-8 (or utf-16 if there's a BOM).

If you then proceed to handle it it in java as strings, you will be
passing around utf-16 encoded strings of characters...

The rules are simple: use the encoding that the document is actually
encoded in, otherwise whatever is reading the bytes will have to output
'?' for bytes it has no character for.

Cheers
andrew

Reply via email to