Re: W3C and UTF-16

Wesley J Landaker Thu, 08 Apr 2004 17:56:11 -0700

On Thursday 08 April 2004 6:35 pm, Michael B Allen wrote:
> srintuar said:
> >>The W3C claims all apps should use UTF-16 internally
> >
> > Ghastly recommendation. I'd sooner see utf-16 deprecated as a
> > unicode encoding than advise it be used anywhere where its not
> > strictly mandatory for *backwards* compatibility.
> >
> > Do you have a link to this malfeasance? Perhaps Im using the
> > wrong search keys...
>
> This is probably states the definitive position for text handling:
>
> http://www.w3.org/TR/1999/WD-charmod-19991129/#Encodings


From that document, in section 3.3, 7th paragraph:

"If the unique encoding approach is adopted, the chosen encoding MUST be 
such that it covers the needs of the largest possible audience, 
including coverage for as many human languages as possible. In 
practice, this will most likely mean that the choice will be one of the 
standard encodings of ISO 10646/Unicode. If some measure of 
compatibility with ASCII is desired, UTF-8 (see [RFC 2279]) is most 
probably the UCS encoding of choice; on the Internet, the IETF Charset 
Policy [RFC 2277] specifies that "Protocols MUST be able to use the 
UTF-8 charset". Another UCS encoding very worthy of consideration, 
especially for APIs, is UTF-16 (see [UTF-16]). "

There definately isn't a preference for UTF-16 over UTF-8 in general. 
UTF-16 seems to only be recommended in APIs, etc, where it's already 
implemented (i.e. DOM in Java):

> But even though the encoding is not clearly stated as UTF-16, the
> Document Object Model (DOM) which is basically the document tree
> inside a web browser and key to all HTML and XML processing including
> JavaScript and XSLT processing *requires* the encoding be UTF-16:
>
> http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-
>C74D1578

From that document, near the beginning of section 1.2.1:

"The UTF-16 encoding was chosen because of its widespread industry 
practice. Note that for both HTML and XML, the document character set 
(and therefore the notation of numeric character references) is based 
on UCS [ISO/IEC 10646]. "

... and ...

"For Java and ECMAScript, DOMString is bound to the String type because 
both languages also use UTF-16 as their encoding."

UTF-16 was picked for DOM because it was initially implemented in Java. 
In the first document, again, they support this in section 3:

"Layer 1: Physical representation. This is necessary for APIs that 
expose a physical representation of string data. Example: For the [DOM] 
Level 1, UTF-16 was chosen based on current widespread implementation 
practice."

Anyway...

-- 
Wesley J. Landaker <[EMAIL PROTECTED]>
OpenPGP FP: 4135 2A3B 4726 ACC5 9094  0097 F0A9 8A4C 4CD6 E3D2

pgp00000.pgp
Description: signature

Re: W3C and UTF-16

Reply via email to