On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
IMO, the "normalization" argument is overrated. I've yet to
encounter a real-world case of normalization: only hand written
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be
aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.
As for "the belief that iterating by code point has utility." I
have to strongly disagree. Unicode is composed of codepoints,
and that is what we handle. The fact that it can be be encoded
and stored as UTF is implementation detail.
We don't "handle" code points (when have you ever wanted to
handle a combining character separate to the character it
combines with?)
You are just thinking of a subset of languages and locales.
Normalization is an issue any time you have a user enter text
into your program and you then want to search for that text. I
hope we can agree this isn't a rare occurrence.
AFAIK, there is only one exception, stuff like s.all!(c => c
== 'é'), but as Vladimir correctly points out: (a) by code
point, this is still broken in the face of normalization, and
(b) are there any real applications that search a string for a
specific non-ASCII character?
But *what* other kinds of algorithms are there? AFAIK, the
*only* type of algorithm that doesn't need decoding is
searching, and you know what? std.algorithm.find does it
perfectly well. This trickles into most other algorithms too:
split, splitter or findAmong don't decode if they don't have
too.
Searching, equality testing, copying, sorting, hashing,
splitting, joining...
I can't think of a single use-case for searching for a non-ASCII
code point. You can search for strings, but searching by code
unit is just as good (and fast by default).
AFAIK, the most common algorithm "case insensitive search"
*must* decode.
But it must also normalize and take locales into account, so by
code point is insufficient (unless you are willing to ignore
languages like Turkish). See Turkish I.
http://en.wikipedia.org/wiki/Turkish_I
Sure, if you just want to ignore normalization and several
languages then by code point is just fine... but that's the
point: by code point is incorrect in general.
There may still be cases where it is still not working as
intended in the face of normalization, but it is still leaps
and bounds better than what we get iterating with codeunits.
To turn it the other way around, *what* are you guys doing,
that doesn't require decoding, and where performance is such a
killer?
Searching, equality testing, copying, sorting, hashing,
splitting, joining...
The performance thing can be fixed in the library, but my concern
is (a) it takes a significant amount of code to do so (b)
complicates implementations. There are many, many algorithms in
Phobos that are special cased for strings, and I don't think it
needs to be that way.
To those that think the status quo is better, can you give an
example of a real-life use case that demonstrates this?
I do not know of a single bug report in regards to buggy phobos
code that used front/popFront. Not_a_single_one (AFAIK).
On the other hand, there are plenty of cases of bugs for
attempting to not decode strings, or incorrectly decoding
strings. They are being corrected on a continuous basis.
Can you provide a link to a bug?
Also, you haven't answered the question :-) Can you give a
real-life example of a case where code point decoding was
necessary where code units wouldn't have sufficed?
You have mentioned case-insensitive searching, but I think I've
adequately demonstrated that this doesn't work in general by code
point: you need to normalize and take locales into account.