On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:
On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
By whom? The "support level 1" folks yonder at the Unicode standard? :o)
-- Andrei

Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?

The level 2 support description noted that it should be opt-in because its slow. Arguably it should be easier to operate on code units if you know its safe to do so, but either always working on code units or always working on graphemes as the default seems to be either too broken too often or too slow too often.

Now one can argue either consistency for code units (because then we can treat char[] and friends as a slice) or correctness for graphemes but really the more I think about it the more I think there is no good default and you need to learn unicode anyways. The only sad parts here are that 1) we hijacked an array type for strings, which sucks and 2) that we dont have an api that is actually good at teaching the user what it does and doesnt do.

The consequence of 1 is that generic code that also wants to deal with strings will want to special-case to get rid of auto-decoding, the consequence of 2 is that we will have tons of not-actually-correct string handling. I would assume that almost all string handling code that is out in the wild is broken anyways (in code I have encountered I have never seen attempts to normalize or do other things before or after comparisons, searching, etc), unless of course, YOU or one of your colleagues wrote it (consider that checking the length of a string in Java or C# to validate it is no longer than X characters is often done and wrong, because .Length is the number of UTF-16 code units in those languages) :o)

So really as bad and alarming as "incorrect string handling" by default seems, it in practice of other languages that get used way more than D has not prevented people from writing working (internationalized!) applications in those languages. One could say we should do it better than them, but I would be inclined to believe that RCStr provides our opportunity to do so. Having char[] be what it is is an annoying wart, and maybe at some point we can deprecate/remove that behaviour, but for now Id rather see if RCStr is viable than attempt to change semantics of all string handling code in D.

Reply via email to