On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
Searching for characters in strings would be difficult to deem inappropriate.

The notion of "character" exists only in certain writing systems. It is thus a flawed practice, and I think it should not be encouraged, as it will only make writing truly-international software more difficult. A more correct approach is searching for a certain substring. If non-exact matching is needed (normalization, case insensitivity etc.), then the appropriate solution is to use the Unicode algorithms.

If you look at the situation from this point of view, single code points become merely an implementation detail.

1. All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness.

As previously discussed, "correctness" here is conditional. I would not use that word, it is another extreme.

From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern.

What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard library from day 1, and advocated their use throughout the documentation.

I'm inclined to say that the correct approach is to
state that algorithms operate explicitly on a T.sizeof basis and that if the data contained in a particular range has some multi-element encoding then separate, specialized routines should be used with the T.sizeof
behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing slicing and range capabilities, of course.

So the problem to me is that we're stuck not fixing something that's horribly broken just because it's broken in a way that people presumably
now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D containing a writing system such as Sanskrit's?

I'd personally like to see this fixed and I think the new behavior is preferable overall, but I do share Andrei's concern that such a big
change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of dchars in an application are incorrect, and ultimately a time bomb for proper internationalization support. We need to apply the same procedure that we do with any language construct that was deemed to have been a poor decision: put it through a deprecation cycle and fix it.

Reply via email to