spir schrieb:
On Thu, 11 Nov 2010 09:40:05 -0800
Andrei Alexandrescu <seewebsiteforem...@erdani.org> wrote:

string substring(string s, size_t beg, size_t end) // "logical slice" -
from code point number beg to code point number end
That's not implemented and I don't think it would be useful. Usually when I want a substring, the calculations up to that point indicate the code _unit_ I'm at.

Yes, but a code unit does not represent a character, instead a unicode "abstract 
character".

void main() {
    dstring s = "\u0061\u0302\u006d\u0065"d;
    writeln(s);     // "âme"
    assert(s[0..1] == "a");
    assert(s.indexOf("â") == -1);
}

A "user-perceived character" (also strangely called "grapheme" in unicode docs) can be represented by an arbitrary number of code 
_units_ (up to 8 in their test data, but there is no actual limit). What a code unit represents is, say, a "scripting mark". In 
"â", there are 2 of them. For legacy reasons, UCS also includes "precombined characters", so that "â" can also be 
represented by a single code, indeed. But the above form is valid, it's even arguably the base form for "â" (and most composite chars 
cannot be represented by a single code).


OMG, this is worse than I thought O_O
I thought "ok, for UTF-8 one code unit is one byte and one 'real', visible character is called a code point and consists of 1-4 code units" - but having "user-perceived characters" that consist of multiple code units is sick. Unicode has a way to tell if a sequence of code units (bytes) belongs together or not, so identifying code points isn't too hard. But is there a way to identify "graphemes"? Other then a list of rules like "a sequence of the two code points <foo> and <bar> make up one "grapheme" <foobar>?

Reply via email to