Re: Dicebot on leaving D: It is anarchy driven development in all its glory.

aliak via Digitalmars-d Thu, 06 Sep 2018 12:05:52 -0700

On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:

On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc viaDigitalmars-d wrote:
On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> // D
> auto a = "á";
> auto b = "á";
> auto c = "\u200B";
> auto x = a ~ c ~ a;
> auto y = b ~ c ~ b;
>> writeln(a.length); // 2 wtf
> writeln(b.length); // 3 wtf
> writeln(x.length); // 7 wtf
> writeln(y.length); // 9 wtf
[...]
This is an unfair comparison. In the Swift version you used.count, but here you used .length, which is the length of thearray, NOT the number of characters or whatever you expect itto be. You should rather use .count and specify exactly whatyou want to count, e.g., byCodePoint or byGrapheme.
I suspect the Swift version will give you unexpected results ifyou did something like compare "á" to "a\u301", for example(which, in case it isn't obvious, are visually identical toeach other, and as far as an end user is concerned, should onlycount as 1 grapheme).
Not even normalization will help you if you have a string like"a\u301\u302": in that case, the *only* correct way to countthe number of visual characters is byGrapheme, and I highlydoubt Swift's .count will give you the correct answer in thatcase. (I expect that Swift's .count will count code points, asis the usual default in many languages, which is unfortunatelywrong when you're thinking about visual characters, which arecalled graphemes in Unicode parlance.)
And even in your given example, what should .count return whenthere's a zero-width character? If you're counting the numberof visual places taken by the string (e.g., you're trying toalign output in a fixed-width terminal), then *both* versionsof your code are wrong, because zero-width characters do notoccupy any space when displayed. If you're counting the numberof code points, though, e.g., to allocate the right buffer sizeto convert to dstring, then you want to count the zero-widthcharacter as 1 rather than 0. And that's not to mentiondouble-width characters, which should count as 2 if you'reoutputting to a fixed-width terminal.
Again I say, you need to know how Unicode works. Otherwise youcan easily deceive yourself to think that your code (both in Dand in Swift and in any other language) is correct, when infact it will fail miserably when it receives input that youdidn't think of. Unicode is NOT ASCII, and you CANNOT assumethere's a 1-to-1 mapping between "characters" and displaylength. Or 1-to-1 mapping between any of the various conceptsof string "length", in fact.
In ASCII, array length == number of code points == number ofgraphemes == display width.
In Unicode, array length != number of code points != number ofgraphemes != display width.
Code written by anyone who does not understand this is WRONG,because you will inevitably end up using the wrong value forthe wrong thing: e.g., array length for number of code points,or number of code points for display length. Not even.byGrapheme will save you here; you *need* to understand thatzero-width and double-width characters exist, and what theyimply for display width. You *need* to understand thedifference between code points and graphemes. There is nosingle default that will work in every case, because there areDIFFERENT CORRECT ANSWERS depending on what your code is tryingto accomplish. Pretending that you can just brush all thisdetail under the rug of a single number is just deceivingyourself, and will inevitably result in wrong code that willfail to handle Unicode input correctly.
T

It's a totally fair comparison. .count in swift is the equivalentof .length in D, you use that to get the size of an array, etc.They've just implemented string.length asstring.byGrapheme.walkLength. So it's intuitively correct (andyes, slower). If you didn't want the default though then youcould also specify what "view" over characters you want. E.g.


let a = "á̂"
a.count // 1 <-- Yes it is exactly as expected.
a.unicodeScalars // 3
a.utf8.count // 5

I don't really see any issues with a zero-width character. If youwant to deal with screen width (i.e. pixel space) that's not thesame as how many characters are in a string. And it doesn'tmatter whether you go byGrapheme or byCodePoint or byCodeUnitbecause none of those represent a single column on screen. Azero-width character is 0 *width* but it's still *one* character.There's no .length/size/count in any language (that I've heardof) that'll give you your screen space from their string type.You query the font API for that as that depends on font size,kerning, style and face.

And again, I agree you need to know how unicode works. I don'targue that at all. I'm just saying that having the default beincorrect for application logic is just silly and when peoplehave to do things like string.representation.normalize.byGraphemeor whatever to search for a character in a string *correctly* ...well, just, ARGH!

D makes the code-point case default and hence that becomes thesimplest to use. But unfortunately, the only thing I can think ofthat requires code point representations is when dealingspecifically with unicode algorithms (normalization, etc). Here'sa good read on code points:https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ -

tl;dr: application logic does not need or want to deal with codepoints. For speed units work, and for correctness, graphemes work.

Yes you will fail miserably when you receive input you did notexpect. That's always true. That's why we have APIs that make iteasier or harder to fail more or less. Expecting people to beunicode experts before using unicode is also unreasonable - orjust makes it easier to fail, must easier. I sit next to one ofthe guys who worked on unicode in Qt and he couldn't explain thedifference between a grapheme and an extended grapheme cluster...I'm not saying I can btw... I'm just saying unicode is frikkinhard. And we don't need APIs making it harder to get right -which is exactly what non-correct-by-default APIs do.

I think to boil it down to one sentence is I think it's silly tohave a string type that is advertised as unicode but optimizedfor latin1 ... ish because people will use it for unicode and getincorrect results with its naturally intuitive usage.


Cheers,
- Ali

Re: Dicebot on leaving D: It is anarchy driven development in all its glory.

Reply via email to