Re: Any plans to support UTF-32 BOM?

Michael Glavassevich Mon, 13 Aug 2012 13:51:58 -0700

Hi Gary,

Gary Gregory <[email protected]> wrote on 13/08/2012 02:27:33 PM:


> Hi Michael,
> 
> I’ve not caught one in the savannah either! I've not had a customer 
> request for it either, that, or the request did not make it through 
> our sales engineers, professional services, or tech support all the way 
to me.
> 
> Our products are XML and buzzword compliant and I am checking my Ps 
> and Qs. So, at this point, the point is rather academic as you mention. 

XML parsers are only required to support UTF-8 and UTF-16. Support for any 
other encodings is icing on the cake.

> I am aware of the inefficiencies involved, but our customers can 
> decide how efficient they want to be for themselves, sometimes they 
> have no control over the format of the documents they have to 
> process with our software. For those who can control the format, I 
> do not know if someone has tried UTF-32, watched it blow up and then
> switched to something.
> 
> Now, out of curiosity, I do notice a 
> org.apache.xerces.impl.io.UCSReader class in Xerces which is used 
> from a couple of places.
> 
> Is that not hooked up in all the right spots?

It is, but if presented with a UTF-32 BOM, Xerces won't hit the code path 
where the UCSReader would be used since its encoding auto-detector doesn't 
recognize UTF-32 BOM byte sequences. It's probably just defaulting to 
UTF-8 (since it has no better guess) and then bombs out.

Assuming Xerces did support UTF-32 the UCSReader might not be the right 
reader to use anyway. A compliant UTF-32 Reader might require more error 
checking (e.g. to reject non-characters, like the byte sequences that 
would be used to represent surrogates in UTF-16).

> Gary

> On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich 
<[email protected]
> > wrote:
> Hi Gary, 
> 
> There haven't been any plans for UTF-32 support. It seems you're the
> first [1] (and only) one who has asked about it on the project lists. 
> 
> Is this just an academic question or do you have an actual need for it? 
> 
> I must say I've never seen a UTF-32 encoded document in the wild. In
> my opinion it's a very inefficient encoding. Always uses 32-bits to 
> represent a character when the largest Unicode code point only 
> requires 21-bits. UTF-8 and UTF-16 only ever use that much space for
> supplementary characters (i.e. code points greater than U+FFFF). 
> 
> Thanks. 
> 
> [1] http://xerces-j.markmail.org/search/?q=UTF-32 
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: [email protected] 
> E-mail: [email protected] 
> 
> Gary Gregory <[email protected]> wrote on 13/08/2012 01:49:46 PM:
> 
> 
> > Hi All:
> > 
> > Any plans to support UTF-32 BOM?
> > 
> > Currently, if I parse a UTF-32 document I get 'content not expected 
> > in prolog" error.
> > 
> > Thank you,
> > Gary
> > 
> > -- 
> > E-Mail: [email protected] | [email protected] 
> > JUnit in Action, 2nd Ed: 
> http://bit.ly/ECvg0
> 
> > Spring Batch in Action: http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com 
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
> 
> 
> 
> -- 
> E-Mail: [email protected] | [email protected] 
> JUnit in Action, 2nd Ed: http://bit.ly/ECvg0
> Spring Batch in Action: http://bit.ly/bqpbCK
> Blog: http://garygregory.wordpress.com 
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Re: Any plans to support UTF-32 BOM?

Reply via email to