Re: Major performance problem with std.array.front()

Dmitry Olshansky Sun, 09 Mar 2014 11:16:32 -0700

09-Mar-2014 21:16, Andrei Alexandrescu пишет:

On 3/9/14, 4:34 AM, Peter Alexander wrote:

I think this is the main confusion: the belief that iterating by code
point has utility.


If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).


I suspect that code point iteration is the worst as it works only with
ASCII and perchance with ASCII single-byte extensions. Then we have code
unit iteration that works with a larger spectrum of languages.


Was clearly meant to be: code point <--> code unit

One
question would be how large that spectrum it is. If it's larger than
English, then that would be nice because we would've made progress.

Code points help only in so far that many (~all) high-level algorithmsin Unicode are described in terms of code points. Code points haveproperties, code unit do not have anything. Code points with assignedsemantic value are "abstract characters".

It's up to programmer to implement a particular algorithm to make it "asif" decoding really happened, working directly on code units or dodecoding and work with code points which is simpler.

Current std.uni offering mostly work on code points and decodes, crucialbuilding block to work directly on code units is in review:


https://github.com/D-Programming-Language/phobos/pull/1685

I don't know about normalization beyond discussions in this group, but
as far as I understand from
http://www.unicode.org/faq/normalization.html, normalization would be a
one-step process, after which code point iteration would cover still
more human languages. No? I'm pretty sure it's more complicated than
that, so please illuminate me :o).

Technically most apps just assume say "input comes in UTF-8 that is innormalization form C". Other such as browsers strive to get uniformrepresentation on any input, do normalization of any input (often timesnormalization turns out to be just a no-op).

If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?


What happened to counting characters and such?

Counting chars is dubious. But, for instance, collation is defined interms of code points. Regex pattern matching is _defined_ in terms ofcodepoints (even the mystical level 3 Unicode support of it). So thereis certain merit to work at that level. But hacking it to be this wayisn't the way to go.

The least intrusive change would be to generalize the current choicew.r.t. to RA ranges of char/wchar.


--
Dmitry Olshansky

Re: Major performance problem with std.array.front()

Reply via email to