[r6rs-discuss] Strings

Jason Orendorff Tue, 27 Mar 2007 05:02:40 -0800

Jon Wilson wrote:

Jason Orendorff wrote:
> And most (but not all) Unicode string implementations use UTF-16.
> Among languages and libraries that are very widely used, the majority
> is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt,
> Xerces-C, and on and on.  (The few counterexamples use UTF-8: glib,
> expat.  And expat can be compiled to use UTF-16.)
If this is true, then I would expect to find relatively little mention
of UTF-8 compared to UTF-16 on the internet.  However, the google test
turns up *1,040,000* for *utf-16* versus *173,000,000* for *utf-8*.
Now, of course I realize that this is a particularly crude technique for
determining the relative popularity of UTF-8 and UTF-16, but even a very
crude technique does not cause this much of a discrepancy.  173 : 1 is
quite a steep ratio.


By this reckoning, UTF-8 is more popular than Unicode, which only
gets 39,000,000 hits.  Actually, according to Google, UTF-8 is more
popular than Jesus.

Incidentally, if you don't adjust for cluefulness, UTF-16 is more often
called "Unicode".  Dreadful but true, especially in the Windows and
Java worlds.  Bottom line:  nobody else thinks about this stuff but
language designers and highly clueful library designers.

The Internet Engineering Task Force (IETF) requires all Internet
protocols to identify the encoding used for character data with UTF-8 as
at least one supported encoding.


As a *transmission* format, UTF-8 is much more common than UTF-16,
for good reasons--but nowhere near as common as, say, Latin-1.  In
other words, when doing I/O, a transcoding step is usually necessary
anyway.

-j

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

[r6rs-discuss] Strings

Reply via email to