On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.

This is not true for UTF-8, which is not subject to "endianism".

You are correct in that UTF-8 is endian agnostic, but I don't
believe that was Sönke's point. The point is that ä can be
produced in Unicode in more than one way. This program
illustrates:

import std.stdio;
void main()
{
      string a = "ä";
      string b = "a\u0308";
      writeln(a);
      writeln(b);
      writeln(cast(ubyte[])a);
      writeln(cast(ubyte[])b);
}

This prints:

ä
ä
[195, 164]
[97, 204, 136]

Notice that they are both the same "character" but have different
representations. The first is just the 'ä' code point, which
consists of two code units, the second is the 'a' code point
followed by a Combining Diaeresis code point.

In short, the string "ä" could be either 2 or 3 code units, and
either 1 or 2 code points.

Reply via email to