On Thursday 08 April 2004 6:35 pm, Michael B Allen wrote: > srintuar said: > >>The W3C claims all apps should use UTF-16 internally > > > > Ghastly recommendation. I'd sooner see utf-16 deprecated as a > > unicode encoding than advise it be used anywhere where its not > > strictly mandatory for *backwards* compatibility. > > > > Do you have a link to this malfeasance? Perhaps Im using the > > wrong search keys... > > This is probably states the definitive position for text handling: > > http://www.w3.org/TR/1999/WD-charmod-19991129/#Encodings
From that document, in section 3.3, 7th paragraph: "If the unique encoding approach is adopted, the chosen encoding MUST be such that it covers the needs of the largest possible audience, including coverage for as many human languages as possible. In practice, this will most likely mean that the choice will be one of the standard encodings of ISO 10646/Unicode. If some measure of compatibility with ASCII is desired, UTF-8 (see [RFC 2279]) is most probably the UCS encoding of choice; on the Internet, the IETF Charset Policy [RFC 2277] specifies that "Protocols MUST be able to use the UTF-8 charset". Another UCS encoding very worthy of consideration, especially for APIs, is UTF-16 (see [UTF-16]). " There definately isn't a preference for UTF-16 over UTF-8 in general. UTF-16 seems to only be recommended in APIs, etc, where it's already implemented (i.e. DOM in Java): > But even though the encoding is not clearly stated as UTF-16, the > Document Object Model (DOM) which is basically the document tree > inside a web browser and key to all HTML and XML processing including > JavaScript and XSLT processing *requires* the encoding be UTF-16: > > http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID- >C74D1578 From that document, near the beginning of section 1.2.1: "The UTF-16 encoding was chosen because of its widespread industry practice. Note that for both HTML and XML, the document character set (and therefore the notation of numeric character references) is based on UCS [ISO/IEC 10646]. " ... and ... "For Java and ECMAScript, DOMString is bound to the String type because both languages also use UTF-16 as their encoding." UTF-16 was picked for DOM because it was initially implemented in Java. In the first document, again, they support this in section 3: "Layer 1: Physical representation. This is necessary for APIs that expose a physical representation of string data. Example: For the [DOM] Level 1, UTF-16 was chosen based on current widespread implementation practice." Anyway... -- Wesley J. Landaker <[EMAIL PROTECTED]> OpenPGP FP: 4135 2A3B 4726 ACC5 9094 0097 F0A9 8A4C 4CD6 E3D2
pgp00000.pgp
Description: signature