On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" <schvei...@yahoo.com> said:

It's even easier than this:

a) you want to do a proper string comparison not knowing what state the
unicode strings are in, use the full-fledged decode-when-needed string
type, and its associated str.find method.
b) you know they are both the same normalized form and want to optimize,
use std.algorithm.find(haystack.asArray, needle.asArray).

Well, treating the string as an array of dchar doesn't work in the general case, even with strings normalized the same way your fiancé example can break. So should never treat them as plain arrays unless I'm sure I have no combining marks in the string.

I'm not opposed to having a new string type being developed, but I'm skeptical about its inclusion in the language. We already have three string types which can be assumed to contain valid UTF sequences. I think the first thing to do is not to develop a new string type, but to develop the normalization and grapheme splitting algorithms, and those to find a substring, using the existing char[], wchar[] and dchar[] types. Then write a program with proper handling of Unicode using those and hand-optimize it. If that proves to be a pain (it might well be), write a new string type, rewrite the program using it, do some benchmarks and then we'll know if it's a good idea, and will be able to quantify the drawbacks.

But right now, all this arguing for or against a new string type is stacking hypothesis against other hypothesis, it won't lead anywhere.

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to