On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
Searching for characters in strings would be difficult to deem
inappropriate.
The notion of "character" exists only in certain writing systems.
It is thus a flawed practice, and I think it should not be
encouraged, as it will only make writing truly-international
software more difficult. A more correct approach is searching for
a certain substring. If non-exact matching is needed
(normalization, case insensitivity etc.), then the appropriate
solution is to use the Unicode algorithms.
If you look at the situation from this point of view, single code
points become merely an implementation detail.
1. All algorithms would by default operate on strings at
char/wchar level (i.e. code unit). That would cause the usual
issues and confusions I was aware of from C++. Certain
algorithms would require specialization and/or the user using
byDchar for correctness.
As previously discussed, "correctness" here is conditional. I
would not use that word, it is another extreme.
From experience with C++ I knew (1) had a bad track record, and
(2) "generically conservative, specialize for speed" was a
successful pattern.
What would you have chosen given that context?
Ideally, we would have the Unicode algorithms in the standard
library from day 1, and advocated their use throughout the
documentation.
I'm inclined to say that the correct approach is to
state that algorithms operate explicitly on a T.sizeof basis
and that if
the data contained in a particular range has some
multi-element encoding
then separate, specialized routines should be used with the
T.sizeof
behavior will not produce the desired result.
That sounds quite like C++ plus ICU. It doesn't strike me as
the golden standard for Unicode integration.
Why not? Because it sounds like D needs exactly that. Plus its
amazing slicing and range capabilities, of course.
So the problem to me is that we're stuck not fixing something
that's
horribly broken just because it's broken in a way that people
presumably
now expect.
Clearly I'm being subjective here but again I'd find it
difficult to get convinced we have something horribly broken
from the evidence I gathered inside and outside Facebook.
Have you or anyone you personally know tried to process text in D
containing a writing system such as Sanskrit's?
I'd personally like to see this fixed and I think the new
behavior is
preferable overall, but I do share Andrei's concern that such
a big
change might hurt the language anyway.
I've said this once and I'm saying it again: the best way to
convert this discussion into something useful is to devise
ideas for useful non-breaking additions.
I disagree. As I've argued, I believe that currently most uses of
dchars in an application are incorrect, and ultimately a time
bomb for proper internationalization support. We need to apply
the same procedure that we do with any language construct that
was deemed to have been a poor decision: put it through a
deprecation cycle and fix it.