This will _not_ return a trailing surrogate of a Cyrillic
letter. It will return the second code unit of the "ä"
character (U+00E4).
True. It's UTF-8, not UTF-16.
However, it could also yield the first code unit of the umlaut
diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to "endianism".
If the string were in UTF-32, [2] could yield either the
Cyrillic character, or the umlaut diacritic.
The .length of the UTF-32 string could be either 3 or 4.
Both are not true for UTF-32. There is no interpretation (except
for the "endianism", which could be taken care of in a
library/the core) for the code point.
There are multiple reasons why .length and index access is
based on code units rather than code points or any higher level
representation, but one is that the complexity would suddenly
be O(n) instead of O(1).
see my last statement below
In-place modifications of char[] arrays also wouldn't be
possible anymore
They would be, but
as the size of the underlying array might have to change.
Well that's a point; on the other hand, D is constantly creating
and throwing away new strings, so this isn't quite an argument.
The current solution puts the programmer in charge of dealing
with UTF-x, where a more consistent implementation would put the
burden on the implementors of the libraries/core, i.e. the ones
who usually have a better understanding of Unicode than the
average programmer.
Also, implementing such a semantics would not per se abandon a
byte-wise access, would it?
So, how do you guys handle UTF-8 strings in D? What are your
solutions to the problems described? Does it all come down to
converting "string"s and "wstring"s to "dstring"s, manipulating
them, and re-convert them to "string"s? Btw, what would this mean
in terms of speed?
These is no irony in my questions. I'm really looking for
solutions...