First off, i'll have to admit i'm using a really old version of xerces, but i'm noticing something a little peculiar with the handling of data encodings...
The program someone here wrote basically sucks files in various encodings into Java Strings and then runs them through xerces using a StringReader wrapped into an InputSource. The process of sucking the bytes in from the file gets them converted from whatever they were in to ucs2 using the default locale, which is latin1. Now, if the input is *actually* utf-8, this results in the multi-byte encodings being broken up and treated as indivdual characters, which is bad. My questions are : 1) how is xerces working with String input at all? Most of these documents contain the <?xml encoding="iso-8859-1"?> line at the top, which should be gating how it looks at them, but by the time it's in a String, all of the document including the declaration line are *actually* in ucs2. Does Xerces try to be flexible internally when differentiating between a byte and a char? Does it try to equate them, essentially? 2) if #1 is yes, would i get around the problem by adding <?xml encoding ="utf-8"?> explicitly to the documents, engaging this flexibility? 3) without an explicit encoding declaration, does xerces default over to ucs2 being the default interpretation, rather than utf-8? thanks -mark --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
