Markus isn't on es-discuss, so forwarding.... ---------- Forwarded message ---------- From: Markus Scherer <markus....@gmail.com> Date: Wed, May 18, 2011 at 22:18 Subject: Re: Full Unicode strings strawman To: Allen Wirfs-Brock <al...@wirfs-brock.com> Cc: Shawn Steele <shawn.ste...@microsoft.com>, Mark Davis ☕ < m...@macchiato.com>, "es-discuss@mozilla.org" <es-discuss@mozilla.org>
On Mon, May 16, 2011 at 5:07 PM, Allen Wirfs-Brock <al...@wirfs-brock.com>wrote: > I agree that application writer will continue for the foreseeable future > have to know whether or not they are dealing with UTF-16 encoded data and/or > communicating with other subsystems that expect such data. However, core > language support for UTF-32 is a prerequisite for ever moving beyond > UTF-16APIs and libraries and getting back to uniform sized character > processing. > This seems to be based on a misunderstanding. Fixed-width encodings are nice but not required. The majority of Unicode-aware code uses either UTF-8 or UTF-16, and supports the full Unicode code point range without too much trouble. Even with UTF-32 you get "user characters" that require sequences of two or more code points (e.g., base character + diacritic, Han character + variation selector) and there is not always a composite character for such a sequence. Windows NT uses 16-bit Unicode, started BMP-only and has supported the full Unicode range since Windows 2000. MacOS X uses 16-bit Unicode (coming from NeXT) and supports the full Unicode range. (Ever since MacOS X 10.0 I believe.) Lower-level MacOS APIs use UTF-8 char* and support the full Unicode range. ICU uses 16-bit Unicode, started BMP-only and has supported the full range in most services since the year 2000. Java uses 16-bit Unicode, started BMP-only and has supported the full range since Java 5. KDE uses 16-bit Unicode, started BMP-only and has supported the full range for years. Gnome uses UTF-8 and supports the full range. JavaScript uses 16-bit Unicode, is still BMP-only although most implementations input and render the full range, and updating its spec and implementations to upgrade compatibly like everyone else seems like the best option. In a programming language like JavaScript that is heavy on string processing, and interfaces with the UTF-16 DOM and UTF-16 client OSes, a UTF-32 string model might be more trouble than it's worth (and possibly a performance hit). FYI: I proposed full-Unicode support in JavaScript in 2003, a few months before the committee became practically defunct for a while. https://sites.google.com/site/markusicu/unicode/es/unicode-2003 https://sites.google.com/site/markusicu/unicode/es/i18n-2003 Best regards, markus (Google/ICU/Unicode)
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss