On Tuesday, 20 August 2013 at 12:59:13 UTC, Andrej Mitrovic wrote:
On 8/19/13, Ramon <s...@thanks.no> wrote:
Plus UTF, too. Even UTF-8, 16 (a very practical compromise
in
my minds eye because with 16 bits one can deal with *every*
language while still not wasting memory).
UTF-8 can deal with every language as well. But perhaps you
meant
something else here.
Anyway welcome aboard!
I think he meant that every "modern spoken/written" language fits
in the "Basic Multilingual Plane", for which each codepoint fits
in a single UTF16 code unit (2 bytes). Multiple codeunit
uncodings in UTF-16 are *very* rare.
On the other hand, if you encode japanese into UTF-8, then you'll
spend *3* bytes per codepoint, ergo, "wasted memory".
@ Ramon:
I think that is a fallacy:
http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16
Real world usage is *dominated* by ASCII chars. Unless you have a
very specific use case, then, UTF8 will occupy *less* room than
UTF16, even if it contains a lot of foreign characters.
Furthermore, UTF-8 is pretty much the "standard". If you keep
UTF-16, you will probably end up regularly transcoding to UTF-8
to interface with char* functions.
Arguably, the "only" (IMO) usecase for UTF-16, is interfacing
with windows' UCS-2 API. But even then, there'll still be some
overhead, to make sure you don't have any dual-encoded in your
streams.