On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win.

I'd be tempted to not ask "how do we back out", but rather, "how can we take this further"? I'd love to ditch the whole "char"/"dchar" thing altogether, and work with graphemes. But that would be massive involvement.

Why do you think it is better?

Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either.

I think this is the main confusion: the belief that iterating by code point has utility.

If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets).

If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos.

IMO, the "normalization" argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that: 1. It occurs only in special cases that the program should be aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail.

As for the grapheme thing, I'm not actually so sure about it myself, so don't take it too seriously.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too.

AFAIK, the most common algorithm "case insensitive search" *must* decode.

There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer?

To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis.

Seriously, Bearophile suggested "ABCD".sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Even Walter pointed out that such code should work. *Maybe* it is still wrong in regards to graphemes and normalization, but at *least*, the result is not a corrupted UTF-8 stream.

Walter keeps grinding on about "myCharArray.put('é')" not working, but I'm not sure he realizes how dangerous it would actually be to allow such a thing to work.

In particular, in all these cases, a simple call to "representation" will deactivate the feature, giving you the tools you want.

I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page.

Me too. I do see the value in being able to do decode-less iteration. I just think the *default* behavior has the advantage of being correct *most* of the time, and definitely much more correct than without decoding.

I think opt-out of decoding is just a much much much saner approach to string handling.

Reply via email to