On Wednesday, 21 October 2015 at 14:13:43 UTC, Shriramana Sharma
wrote:
John Colvin wrote:
But this is false, no? Since ElementType!string is char and
not dchar?
No. char[], wchar[] and dchar[] all have ElementType dchar.
Strings are special for ranges. It's a bad mistake, but it is
what it is and apparently won't be changed.
Why is it a mistake? That seems a very sane thing, although
somewhat quirky. Since ElementType is a Range primitive, and
apparently iterating through a string as a range will produce
each semantically meaningful Unicode character rather than each
UTF-8 or UTF-16 codepoint, it does make sense to do this.
LOL. This could open up a huge discussion if you're not careful.
A code point is not necessarily a full character. Operating on
individual code units is generally wrong, because you frequently
need multiple code units to get a full code point. Similarly, to
get a full character - what's called a grapheme - you sometimes
need multiple code points. To make matters even worse, the same
grapheme can often be represented by different combinations of
code points (e.g. an accented e can be represented as a single
code point or it could be represented with the code point for e
and the code point for the accent - and depending on the unicode
normalization form being used, the order of those code points
could differ).
So, operating at the code point level does _not_ actually make
your program correct. It gets you closer, but you're still
operating on pieces of characters - and it's arguably more
pernicious, because more of the common characters "just work"
while still not ensuring that all of them work, making it harder
to catch when you screw it up.
However, operating at the grapheme level is incredibly expensive.
In fact, operating at the code point level is often unnecessarily
expensive. So, if you care about efficiency, you want to be
operating at the code unit level as much as possible. And because
most string code doesn't actually need to operate on individual
characters, operating at the code unit level is actually
frequently plenty (especially if your strings have had their code
points normalized so that the same characters will always result
in the same sequence of code units).
So, what we have with Phobos is neither fast nor correct. It's
constantly decoding code points when it's completely unnecessary
(Phobos has to special case its algorithms for strings all over
the place to avoid unnecessary decoding). And because ranges deal
at the code point level by default, they're not correct. Really,
code should either be operating at the code unit level or the
grapheme level. You're getting the worst of both worlds when
operating at the code point level.
Rather, what's really needed is for the programmer to know enough
about Unicode to know when they should be operating on code
points or graphemes (or occasionally code points) and then
explicitly do that - which is why we have
std.utf.byCodeUnit/byChar/byWchar/byDchar and
std.uni.byCodePoint/byGrapheme. But as soon as you use those, you
lose out on the specializations that operate on arrays as well as
any other code that specifically operates on arrays - even when
you just want to operate on a char[] as a range of char.
The reality of the matter is that _most_ algorithms would work
just fine with treating char[] as a range of char so long as they
do explicit decoding when necessary (and it often wouldn't be
necessary), but instead, we're constantly autodecoding, because
that's what front and popFront do for arrays of char or wchar.
When Andrei came up with the current scheme, he didn't know about
graphemes. He thought that code points were always full
characters. And if that were the case, the way Phobos works would
make sense. It might be slower by default, but it would be
correct, and you could special-case on strings to operate on them
more efficiently if you needed the extra efficiency. However,
because code points are _not_ necessarily full characters, we're
taking an efficiency hit without getting full correctness.
Instead, we're getting the illusion of correctness. It's like how
Andrei explained in TDPL that UTF-16 is worse than UTF-8, because
it's harder to catch when you screw up and chop a character in
half. Only, it turns out that that applies to UTF-32 as well.
- Jonathan M Davis