On Sun, 21 Nov 2010 19:27:06 -0600 Andrei Alexandrescu <seewebsiteforem...@erdani.org> wrote:
> > There is no easy notion of "character" in unicode. A code point is *not* > > a character. One character can span multiple code points. I fear > > treating dchars as "the default character unit" is repeating same kind > > of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and > > treating each 2-byte code unit as a character. I mean, what's the point > > of working with the intermediary representation (code points) when it > > doesn't represent a character? > > I understand the concern, and that's why I strongly support formal > abstractions that are supported by, but largely independent from, > representations. If graphemes are to be modeled, D is in better shape > than other languages. What we need to do is define a range byGrapheme() > that accepts char[], wchar[], or dchar[]. Sure, D helps a lot. I agree with abstraction levels independant of internal representation in the general case (I think it's one major aspect and advantage of ranges, isn't it?). But it yields a huge efficiency issue in this very case. Namely that if one deals with a text at the level graphemes while the representation of of a string of code points, then every little routine has to reconstruct the graphemes on the fly. For instance, indexing 3 times will do 3 times the job of constructing a string of graphemes (up to the given indices). Thus, when one has to do text processing, even of the simplest kind, it is necessary to use a dedicated type (or any kind of tool using a high-level representation). (Analog to the need of first decoding code units into code points, only once, before dealing with code points -- but at a higher level.) See also answer to Michel's post. Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com