On Sunday, 28 September 2014 at 14:38:57 UTC, H. S. Teoh via
Digitalmars-d wrote:
On Sun, Sep 28, 2014 at 12:06:16PM +0000, Uranuz via
Digitalmars-d wrote:
On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei
Alexandrescu wrote:
>On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
>>If we can get Andrei on board, I'm all for killing off
>>autodecoding.
>
>That's rather vague; it's unclear what would replace it. --
>Andrei
I believe that removing autodeconding will make things even
worse. As
far as understand if we will remove it from front() function
that
operates on narrow strings then it will return just byte of
char. I
believe that proceeding on narrow string by `user perceived
chars`
(graphemes) is more common use case.
[...]
Unfortunately this is not what autodecoding does today. Today's
autodecoding only segments strings into code *points*, which
are not the
same thing as graphemes. For example, combining diacritics are
normally
not considered separate characters from the user's POV, but
they *are*
separate codepoints from their base character. The only reason
today's
autodecoding is even remotely considered "correct" from an
intuitive POV
is because most Western character sets happen to use only
precomposed
characters rather than combining diacritic sequences. If you
were
processing, say, Korean text, the present autodecoding .front
would
*not* give you what you might imagine is a "single character";
it would
only be halves of Korean graphemes. Which, from a user's POV,
would
suffer from the same issues as dealing with individual bytes in
a UTF-8
stream -- any mistake on the program's part in handling these
half-units
will cause "corruption" of the text (not corruption in the same
sense as
an improperly segmented UTF-8 byte stream, but in the sense
that the
wrong glyphs will be displayed on the screen -- from the user's
POV
these two are basically the same thing).
You might then be tempted to say, well let's make .front return
graphemes instead. That will solve the "single intuitive
character"
issue, but the performance will be FAR worse than what it is
today.
So basically, what we have today is neither efficient nor
complete, but
a halfway solution that mostly works for Western character sets
but
is incomplete for others. We're paying efficiency for only a
partial
benefit. Is it worth the cost?
I think the correct solution is not for Phobos to decide for the
application at what level of abstraction a string ought to be
processed.
Rather, let the user decide. If they're just dealing with
opaque blocks
of text, decoding or segmenting by grapheme is completely
unnecessary --
they should just operate on byte ranges as opaque data. They
should use
byCodeUnit. If they need to work with Unicode codepoints, let
them use
byCodePoint. If they need to work with individual user-perceived
characters (i.e., graphemes), let them use byGrapheme.
This is why I proposed the deprecation path of making it
illegal to pass
raw strings to Phobos algorithms -- the caller should specify
what level
of abstraction they want to work with -- byCodeUnit,
byCodePoint, or
byGrapheme. The standard library's job is to empower the D
programmer by
giving him the choice, not to shove a predetermined solution
down his
throat.
T
I totally agree with all of that.
It's one of those cases where correct by default is far too slow
(that would have to be graphemes) but fast by default is far too
broken. Better to force an explicit choice.
There is no magic bullet for unicode in a systems language such
as D. The programmer must be aware of it and make choices about
how to treat it.