Re: Major performance problem with std.array.front()

Peter Alexander Sun, 09 Mar 2014 08:01:20 -0700

On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:

IMO, the "normalization" argument is overrated. I've yet toencounter a real-world case of normalization: only hand writtencounter-examples. Not saying it doesn't exist, just that:1. It occurs only in special cases that the program should beaware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.
As for "the belief that iterating by code point has utility." Ihave to strongly disagree. Unicode is composed of codepoints,and that is what we handle. The fact that it can be be encodedand stored as UTF is implementation detail.

We don't "handle" code points (when have you ever wanted tohandle a combining character separate to the character itcombines with?)


You are just thinking of a subset of languages and locales.

Normalization is an issue any time you have a user enter textinto your program and you then want to search for that text. Ihope we can agree this isn't a rare occurrence.

AFAIK, there is only one exception, stuff like s.all!(c => c== 'é'), but as Vladimir correctly points out: (a) by codepoint, this is still broken in the face of normalization, and(b) are there any real applications that search a string for aspecific non-ASCII character?
But *what* other kinds of algorithms are there? AFAIK, the*only* type of algorithm that doesn't need decoding issearching, and you know what? std.algorithm.find does itperfectly well. This trickles into most other algorithms too:split, splitter or findAmong don't decode if they don't havetoo.

Searching, equality testing, copying, sorting, hashing,splitting, joining...

I can't think of a single use-case for searching for a non-ASCIIcode point. You can search for strings, but searching by codeunit is just as good (and fast by default).

AFAIK, the most common algorithm "case insensitive search"*must* decode.

But it must also normalize and take locales into account, so bycode point is insufficient (unless you are willing to ignorelanguages like Turkish). See Turkish I.


http://en.wikipedia.org/wiki/Turkish_I

Sure, if you just want to ignore normalization and severallanguages then by code point is just fine... but that's thepoint: by code point is incorrect in general.

There may still be cases where it is still not working asintended in the face of normalization, but it is still leapsand bounds better than what we get iterating with codeunits.
To turn it the other way around, *what* are you guys doing,that doesn't require decoding, and where performance is such akiller?

Searching, equality testing, copying, sorting, hashing,splitting, joining...

The performance thing can be fixed in the library, but my concernis (a) it takes a significant amount of code to do so (b)complicates implementations. There are many, many algorithms inPhobos that are special cased for strings, and I don't think itneeds to be that way.

To those that think the status quo is better, can you give anexample of a real-life use case that demonstrates this?
I do not know of a single bug report in regards to buggy phoboscode that used front/popFront. Not_a_single_one (AFAIK).
On the other hand, there are plenty of cases of bugs forattempting to not decode strings, or incorrectly decodingstrings. They are being corrected on a continuous basis.


Can you provide a link to a bug?

Also, you haven't answered the question :-) Can you give areal-life example of a case where code point decoding wasnecessary where code units wouldn't have sufficed?

You have mentioned case-insensitive searching, but I think I'veadequately demonstrated that this doesn't work in general by codepoint: you need to normalize and take locales into account.

Re: Major performance problem with std.array.front()

Reply via email to