Re: "tuple unpacking" with zip?

Jonathan M Davis via Digitalmars-d Wed, 21 Oct 2015 07:41:36 -0700

On Wednesday, 21 October 2015 at 14:13:43 UTC, Shriramana Sharmawrote:

John Colvin wrote:
But this is false, no? Since ElementType!string is char andnot dchar?
No. char[], wchar[] and dchar[] all have ElementType dchar.Strings are special for ranges. It's a bad mistake, but it iswhat it is and apparently won't be changed.
Why is it a mistake? That seems a very sane thing, althoughsomewhat quirky. Since ElementType is a Range primitive, andapparently iterating through a string as a range will produceeach semantically meaningful Unicode character rather than eachUTF-8 or UTF-16 codepoint, it does make sense to do this.

LOL. This could open up a huge discussion if you're not careful.A code point is not necessarily a full character. Operating onindividual code units is generally wrong, because you frequentlyneed multiple code units to get a full code point. Similarly, toget a full character - what's called a grapheme - you sometimesneed multiple code points. To make matters even worse, the samegrapheme can often be represented by different combinations ofcode points (e.g. an accented e can be represented as a singlecode point or it could be represented with the code point for eand the code point for the accent - and depending on the unicodenormalization form being used, the order of those code pointscould differ).

So, operating at the code point level does _not_ actually makeyour program correct. It gets you closer, but you're stilloperating on pieces of characters - and it's arguably morepernicious, because more of the common characters "just work"while still not ensuring that all of them work, making it harderto catch when you screw it up.

However, operating at the grapheme level is incredibly expensive.In fact, operating at the code point level is often unnecessarilyexpensive. So, if you care about efficiency, you want to beoperating at the code unit level as much as possible. And becausemost string code doesn't actually need to operate on individualcharacters, operating at the code unit level is actuallyfrequently plenty (especially if your strings have had their codepoints normalized so that the same characters will always resultin the same sequence of code units).

So, what we have with Phobos is neither fast nor correct. It'sconstantly decoding code points when it's completely unnecessary(Phobos has to special case its algorithms for strings all overthe place to avoid unnecessary decoding). And because ranges dealat the code point level by default, they're not correct. Really,code should either be operating at the code unit level or thegrapheme level. You're getting the worst of both worlds whenoperating at the code point level.

Rather, what's really needed is for the programmer to know enoughabout Unicode to know when they should be operating on codepoints or graphemes (or occasionally code points) and thenexplicitly do that - which is why we havestd.utf.byCodeUnit/byChar/byWchar/byDchar andstd.uni.byCodePoint/byGrapheme. But as soon as you use those, youlose out on the specializations that operate on arrays as well asany other code that specifically operates on arrays - even whenyou just want to operate on a char[] as a range of char.

The reality of the matter is that _most_ algorithms would workjust fine with treating char[] as a range of char so long as theydo explicit decoding when necessary (and it often wouldn't benecessary), but instead, we're constantly autodecoding, becausethat's what front and popFront do for arrays of char or wchar.

When Andrei came up with the current scheme, he didn't know aboutgraphemes. He thought that code points were always fullcharacters. And if that were the case, the way Phobos works wouldmake sense. It might be slower by default, but it would becorrect, and you could special-case on strings to operate on themmore efficiently if you needed the extra efficiency. However,because code points are _not_ necessarily full characters, we'retaking an efficiency hit without getting full correctness.Instead, we're getting the illusion of correctness. It's like howAndrei explained in TDPL that UTF-16 is worse than UTF-8, becauseit's harder to catch when you screw up and chop a character inhalf. Only, it turns out that that applies to UTF-32 as well.


- Jonathan M Davis

Re: "tuple unpacking" with zip?

Reply via email to