First off, i'll have to admit i'm using a really old version of xerces, but
i'm noticing something a little peculiar with the handling of data
encodings...

The program someone here wrote basically sucks files in various encodings
into Java Strings and then runs them through xerces using a StringReader
wrapped into an InputSource.  The process of sucking the bytes in from the
file gets them converted from whatever they were in to ucs2 using the
default locale, which is latin1.

Now, if the input is *actually* utf-8, this results in the multi-byte
encodings being broken up and treated as indivdual characters, which is
bad.

My questions are :
1) how is xerces working with String input at all?  Most of these documents
contain the <?xml encoding="iso-8859-1"?> line at the top, which should be
gating how it looks at them, but by the time it's in a String, all of the
document including the declaration line are *actually* in ucs2.  Does
Xerces try to be flexible internally when differentiating between a byte
and a char?  Does it try to equate them, essentially?

2) if #1 is yes, would i get around the problem by adding <?xml encoding
="utf-8"?> explicitly to the documents, engaging this flexibility?

3) without an explicit encoding declaration, does xerces default over to
ucs2 being the default interpretation, rather than utf-8?

thanks
-mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to