On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
I don't understand that. Based on your calculations, the
results should have been different. Also how are the numbers
fixed? Like you said the amount of bytes of each encoding is
not always standard for every character. Even if they were
fixed this means 2-bytes for each UTF-16 character and 4-bytes
for each UTF-32 character so still the numbers doesn't make
sense to me. So still the number of the "length" property
should have been the same for every encoding or at least for
UTF-16 and UTF-32. So are the sizes of every character fixed or
not?
Your string is represented by 8 codepoints. The number of
codeunits to represent them in memory depends on the encoding. D
supports to work with 3 different encodings (in the Unicode
standard there are more than these 3)
string utf8s = "Hello 😂\n";
wstring utf16s = "Hello 😂\n"w;
dstring utf32s = "Hello 😂\n"d;
Here the canonical Unicode representation of your string
H e l l o 😂 \n
U+0048 U+0065 U+006C U+006C U+006F U+0020 U+1F602 U+000a
let's see how these 3 variable are represented in memory:
utf8s : 48 65 6C 6C 6F 20 F0 9F 98 82 0a
11 char in memory using 11 bytes
utf16s: 0048 0065 006C 006C 006F 0020 D83D DE02 000A
9 wchar in memory using 18 bytes
utf16s: 00000048 00000065 0000006C 0000006C 0000006F 00000020
0001F602 0000000A
8 dchar in memory using 32 bytes
As you can see, the most compact form is generally UTF-8, that's
why it is the preferred encoding for Unicode.
UTF-16 is supported because of legacy support reason like it is
used in the Windows API and also internally in Java.
UTF-32 has one advantage, in that it has a 1 to 1 mapping between
codepoint and array index. In practice it is not that much of an
advantage as codepoints and characters are disjoint concepts.
UTF-32 uses a lot of memory for practically no benefit (when you
read in the forum about the big auto-decode error of D it is
linked to this).