On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.
If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).
I suspect that code point iteration is the worst as it works only with
ASCII and perchance with ASCII single-byte extensions. Then we have code
unit iteration that works with a larger spectrum of languages. One
question would be how large that spectrum it is. If it's larger than
English, then that would be nice because we would've made progress.
I don't know about normalization beyond discussions in this group, but
as far as I understand from
http://www.unicode.org/faq/normalization.html, normalization would be a
one-step process, after which code point iteration would cover still
more human languages. No? I'm pretty sure it's more complicated than
that, so please illuminate me :o).
If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.
AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?
What happened to counting characters and such?
To those that think the status quo is better, can you give an example of
a real-life use case that demonstrates this?
split(ter) comes to mind.
I do think it's probably too late to change this, but I think there is
value in at least getting everyone on the same page.
Awesome.
Andrei