Hi Gary, Gary Gregory <garydgreg...@gmail.com> wrote on 13/08/2012 02:27:33 PM:
> Hi Michael, > > I’ve not caught one in the savannah either! I've not had a customer > request for it either, that, or the request did not make it through > our sales engineers, professional services, or tech support all the way to me. > > Our products are XML and buzzword compliant and I am checking my Ps > and Qs. So, at this point, the point is rather academic as you mention. XML parsers are only required to support UTF-8 and UTF-16. Support for any other encodings is icing on the cake. > I am aware of the inefficiencies involved, but our customers can > decide how efficient they want to be for themselves, sometimes they > have no control over the format of the documents they have to > process with our software. For those who can control the format, I > do not know if someone has tried UTF-32, watched it blow up and then > switched to something. > > Now, out of curiosity, I do notice a > org.apache.xerces.impl.io.UCSReader class in Xerces which is used > from a couple of places. > > Is that not hooked up in all the right spots? It is, but if presented with a UTF-32 BOM, Xerces won't hit the code path where the UCSReader would be used since its encoding auto-detector doesn't recognize UTF-32 BOM byte sequences. It's probably just defaulting to UTF-8 (since it has no better guess) and then bombs out. Assuming Xerces did support UTF-32 the UCSReader might not be the right reader to use anyway. A compliant UTF-32 Reader might require more error checking (e.g. to reject non-characters, like the byte sequences that would be used to represent surrogates in UTF-16). > Gary > On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich <mrgla...@ca.ibm.com > > wrote: > Hi Gary, > > There haven't been any plans for UTF-32 support. It seems you're the > first [1] (and only) one who has asked about it on the project lists. > > Is this just an academic question or do you have an actual need for it? > > I must say I've never seen a UTF-32 encoded document in the wild. In > my opinion it's a very inefficient encoding. Always uses 32-bits to > represent a character when the largest Unicode code point only > requires 21-bits. UTF-8 and UTF-16 only ever use that much space for > supplementary characters (i.e. code points greater than U+FFFF). > > Thanks. > > [1] http://xerces-j.markmail.org/search/?q=UTF-32 > > Michael Glavassevich > XML Technologies and WAS Development > IBM Toronto Lab > E-mail: mrgla...@ca.ibm.com > E-mail: mrgla...@apache.org > > Gary Gregory <garydgreg...@gmail.com> wrote on 13/08/2012 01:49:46 PM: > > > > Hi All: > > > > Any plans to support UTF-32 BOM? > > > > Currently, if I parse a UTF-32 document I get 'content not expected > > in prolog" error. > > > > Thank you, > > Gary > > > > -- > > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org > > JUnit in Action, 2nd Ed: > http://bit.ly/ECvg0 > > > Spring Batch in Action: http://bit.ly/bqpbCK > > Blog: http://garygregory.wordpress.com > > Home: http://garygregory.com/ > > Tweet! http://twitter.com/GaryGregory > > > > -- > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org > JUnit in Action, 2nd Ed: http://bit.ly/ECvg0 > Spring Batch in Action: http://bit.ly/bqpbCK > Blog: http://garygregory.wordpress.com > Home: http://garygregory.com/ > Tweet! http://twitter.com/GaryGregory Michael Glavassevich XML Technologies and WAS Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org