09-Mar-2014 21:16, Andrei Alexandrescu пишет:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.

If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with
ASCII and perchance with ASCII single-byte extensions. Then we have code
unit iteration that works with a larger spectrum of languages.

Was clearly meant to be: code point <--> code unit

One
question would be how large that spectrum it is. If it's larger than
English, then that would be nice because we would've made progress.


Code points help only in so far that many (~all) high-level algorithms in Unicode are described in terms of code points. Code points have properties, code unit do not have anything. Code points with assigned semantic value are "abstract characters".

It's up to programmer to implement a particular algorithm to make it "as if" decoding really happened, working directly on code units or do decoding and work with code points which is simpler.

Current std.uni offering mostly work on code points and decodes, crucial building block to work directly on code units is in review:

https://github.com/D-Programming-Language/phobos/pull/1685

I don't know about normalization beyond discussions in this group, but
as far as I understand from
http://www.unicode.org/faq/normalization.html, normalization would be a
one-step process, after which code point iteration would cover still
more human languages. No? I'm pretty sure it's more complicated than
that, so please illuminate me :o).

Technically most apps just assume say "input comes in UTF-8 that is in normalization form C". Other such as browsers strive to get uniform representation on any input, do normalization of any input (often times normalization turns out to be just a no-op).


If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

Counting chars is dubious. But, for instance, collation is defined in terms of code points. Regex pattern matching is _defined_ in terms of codepoints (even the mystical level 3 Unicode support of it). So there is certain merit to work at that level. But hacking it to be this way isn't the way to go.

The least intrusive change would be to generalize the current choice w.r.t. to RA ranges of char/wchar.

--
Dmitry Olshansky

Reply via email to