> Hi, > > When Java reads a stream of bytes into characters and encounters a > character outside of the encoding (e.g. not in the ISO-8859-1 character > set) it replaces the character with a '?'. I believe this behaviour is > configurable, but I don't know how (you might have to register your own > converter). By the time Xerces (or Xalan) sees the character, it's too > late. I'm not sure where you configure it, but looking at the source code, > it's a 'substitution mode' flag - there are methods on CharToByteConverter > (and ByteToCharConverter if you're going the other way) to set it, but I'm > not sure how you can set it in your case. If you set it to 'false', the > converter will throw an exception if it encounters an unmappable byte > sequence (or charater). > > Chris
java reads everything in utf-16 unless you tell it otherwise... so, if your xml is acutally encoded in is0-8859-1 and there's no encoding specified in the declaration - the parser will attempt to read it in utf-8 (or utf-16 if there's a BOM). If you then proceed to handle it it in java as strings, you will be passing around utf-16 encoded strings of characters... The rules are simple: use the encoding that the document is actually encoded in, otherwise whatever is reading the bytes will have to output '?' for bytes it has no character for. Cheers andrew
