On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.

If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress.

I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o).

If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

To those that think the status quo is better, can you give an example of
a real-life use case that demonstrates this?

split(ter) comes to mind.

I do think it's probably too late to change this, but I think there is
value in at least getting everyone on the same page.

Awesome.


Andrei

Reply via email to