Re: Creeping Bloat in Phobos

John Colvin via Digitalmars-d Sun, 28 Sep 2014 10:06:06 -0700

On Sunday, 28 September 2014 at 14:38:57 UTC, H. S. Teoh viaDigitalmars-d wrote:

On Sun, Sep 28, 2014 at 12:06:16PM +0000, Uranuz viaDigitalmars-d wrote:
On Sunday, 28 September 2014 at 00:13:59 UTC, AndreiAlexandrescu wrote:
>On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
>>If we can get Andrei on board, I'm all for killing off>>autodecoding.
>
>That's rather vague; it's unclear what would replace it. -->Andrei
I believe that removing autodeconding will make things evenworse. Asfar as understand if we will remove it from front() functionthatoperates on narrow strings then it will return just byte ofchar. Ibelieve that proceeding on narrow string by `user perceivedchars`
(graphemes) is more common use case.
[...]

Unfortunately this is not what autodecoding does today. Today's
autodecoding only segments strings into code *points*, whichare not thesame thing as graphemes. For example, combining diacritics arenormallynot considered separate characters from the user's POV, butthey *are*separate codepoints from their base character. The only reasontoday'sautodecoding is even remotely considered "correct" from anintuitive POVis because most Western character sets happen to use onlyprecomposedcharacters rather than combining diacritic sequences. If youwereprocessing, say, Korean text, the present autodecoding .frontwould*not* give you what you might imagine is a "single character";it wouldonly be halves of Korean graphemes. Which, from a user's POV,wouldsuffer from the same issues as dealing with individual bytes ina UTF-8stream -- any mistake on the program's part in handling thesehalf-unitswill cause "corruption" of the text (not corruption in the samesense asan improperly segmented UTF-8 byte stream, but in the sensethat thewrong glyphs will be displayed on the screen -- from the user'sPOV
these two are basically the same thing).

You might then be tempted to say, well let's make .front return
graphemes instead. That will solve the "single intuitivecharacter"issue, but the performance will be FAR worse than what it istoday.
So basically, what we have today is neither efficient norcomplete, buta halfway solution that mostly works for Western character setsbutis incomplete for others. We're paying efficiency for only apartial
benefit. Is it worth the cost?

I think the correct solution is not for Phobos to decide for the
application at what level of abstraction a string ought to beprocessed.Rather, let the user decide. If they're just dealing withopaque blocksof text, decoding or segmenting by grapheme is completelyunnecessary --they should just operate on byte ranges as opaque data. Theyshould usebyCodeUnit. If they need to work with Unicode codepoints, letthem use
byCodePoint. If they need to work with individual user-perceived
characters (i.e., graphemes), let them use byGrapheme.
This is why I proposed the deprecation path of making itillegal to passraw strings to Phobos algorithms -- the caller should specifywhat levelof abstraction they want to work with -- byCodeUnit,byCodePoint, orbyGrapheme. The standard library's job is to empower the Dprogrammer bygiving him the choice, not to shove a predetermined solutiondown his
throat.


T


I totally agree with all of that.

It's one of those cases where correct by default is far too slow(that would have to be graphemes) but fast by default is far toobroken. Better to force an explicit choice.

There is no magic bullet for unicode in a systems language suchas D. The programmer must be aware of it and make choices abouthow to treat it.

Re: Creeping Bloat in Phobos

Reply via email to