On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
IMO, the "normalization" argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that: 1. It occurs only in special cases that the program should be aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail.

We don't "handle" code points (when have you ever wanted to handle a combining character separate to the character it combines with?)

You are just thinking of a subset of languages and locales.

Normalization is an issue any time you have a user enter text into your program and you then want to search for that text. I hope we can agree this isn't a rare occurrence.


AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too.

Searching, equality testing, copying, sorting, hashing, splitting, joining...

I can't think of a single use-case for searching for a non-ASCII code point. You can search for strings, but searching by code unit is just as good (and fast by default).


AFAIK, the most common algorithm "case insensitive search" *must* decode.

But it must also normalize and take locales into account, so by code point is insufficient (unless you are willing to ignore languages like Turkish). See Turkish I.

http://en.wikipedia.org/wiki/Turkish_I

Sure, if you just want to ignore normalization and several languages then by code point is just fine... but that's the point: by code point is incorrect in general.


There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer?

Searching, equality testing, copying, sorting, hashing, splitting, joining...

The performance thing can be fixed in the library, but my concern is (a) it takes a significant amount of code to do so (b) complicates implementations. There are many, many algorithms in Phobos that are special cased for strings, and I don't think it needs to be that way.


To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis.

Can you provide a link to a bug?

Also, you haven't answered the question :-) Can you give a real-life example of a case where code point decoding was necessary where code units wouldn't have sufficed?

You have mentioned case-insensitive searching, but I think I've adequately demonstrated that this doesn't work in general by code point: you need to normalize and take locales into account.

Reply via email to