spir schrieb:
On Thu, 11 Nov 2010 09:40:05 -0800
Andrei Alexandrescu <seewebsiteforem...@erdani.org> wrote:
string substring(string s, size_t beg, size_t end) // "logical slice" -
from code point number beg to code point number end
That's not implemented and I don't think it would be useful. Usually
when I want a substring, the calculations up to that point indicate the
code _unit_ I'm at.
Yes, but a code unit does not represent a character, instead a unicode "abstract
character".
void main() {
dstring s = "\u0061\u0302\u006d\u0065"d;
writeln(s); // "âme"
assert(s[0..1] == "a");
assert(s.indexOf("â") == -1);
}
A "user-perceived character" (also strangely called "grapheme" in unicode docs) can be represented by an arbitrary number of code
_units_ (up to 8 in their test data, but there is no actual limit). What a code unit represents is, say, a "scripting mark". In
"â", there are 2 of them. For legacy reasons, UCS also includes "precombined characters", so that "â" can also be
represented by a single code, indeed. But the above form is valid, it's even arguably the base form for "â" (and most composite chars
cannot be represented by a single code).
OMG, this is worse than I thought O_O
I thought "ok, for UTF-8 one code unit is one byte and one 'real', visible
character is called a code point and consists of 1-4 code units" - but having
"user-perceived characters" that consist of multiple code units is sick.
Unicode has a way to tell if a sequence of code units (bytes) belongs together
or not, so identifying code points isn't too hard.
But is there a way to identify "graphemes"? Other then a list of rules like "a
sequence of the two code points <foo> and <bar> make up one "grapheme" <foobar>?