On Tuesday, 20 August 2013 at 12:59:13 UTC, Andrej Mitrovic wrote:
On 8/19/13, Ramon <s...@thanks.no> wrote:
Plus UTF, too. Even UTF-8, 16 (a very practical compromise in
my minds eye because with 16 bits one can deal with *every*
language while still not wasting memory).

UTF-8 can deal with every language as well. But perhaps you meant
something else here.

Anyway welcome aboard!

I think he meant that every "modern spoken/written" language fits in the "Basic Multilingual Plane", for which each codepoint fits in a single UTF16 code unit (2 bytes). Multiple codeunit uncodings in UTF-16 are *very* rare.

On the other hand, if you encode japanese into UTF-8, then you'll spend *3* bytes per codepoint, ergo, "wasted memory".

@ Ramon:
I think that is a fallacy:
http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16
Real world usage is *dominated* by ASCII chars. Unless you have a very specific use case, then, UTF8 will occupy *less* room than UTF16, even if it contains a lot of foreign characters.

Furthermore, UTF-8 is pretty much the "standard". If you keep UTF-16, you will probably end up regularly transcoding to UTF-8 to interface with char* functions.

Arguably, the "only" (IMO) usecase for UTF-16, is interfacing with windows' UCS-2 API. But even then, there'll still be some overhead, to make sure you don't have any dual-encoded in your streams.

Reply via email to