On Sun, 21 Nov 2010 19:27:06 -0600
Andrei Alexandrescu <seewebsiteforem...@erdani.org> wrote:

> > There is no easy notion of "character" in unicode. A code point is *not*
> > a character. One character can span multiple code points. I fear
> > treating dchars as "the default character unit" is repeating same kind
> > of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
> > treating each 2-byte code unit as a character. I mean, what's the point
> > of working with the intermediary representation (code points) when it
> > doesn't represent a character?  
> 
> I understand the concern, and that's why I strongly support formal 
> abstractions that are supported by, but largely independent from, 
> representations. If graphemes are to be modeled, D is in better shape 
> than other languages. What we need to do is define a range byGrapheme() 
> that accepts char[], wchar[], or dchar[].

Sure, D helps a lot. I agree with abstraction levels independant of internal 
representation in the general case (I think it's one major aspect and advantage 
of ranges, isn't it?). But it yields a huge efficiency issue in this very case. 
Namely that if one deals with a text at the level graphemes while the 
representation of of a string of code points, then every little routine has to 
reconstruct the graphemes on the fly. For instance, indexing 3 times will do 3 
times the job of constructing a string of graphemes (up to the given indices).
Thus, when one has to do text processing, even of the simplest kind, it is 
necessary to use a dedicated type (or any kind of tool using a high-level 
representation). (Analog to the need of first decoding code units into code 
points, only once, before dealing with code points -- but at a higher level.)
See also answer to Michel's post.

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Reply via email to