On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin <michel.for...@michelf.com> wrote:

On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" <schvei...@yahoo.com> said:

What if the source character is encoded differently than the search
string?  This is basic unicode stuff.  See my example with fiancé.

The more I think about it, the more I think it should work like this: just like we assume they contain well-formed UTF sequences, char[], wchar[], and dchar[] should also be assumed to contain **normalized** unicode strings. Which normalization form to use? no idea. Just pick a sensible one in the four.

Once we know all strings are normalized in the same way, we can then compare two strings bitwise to check if they're the same. And we can check for a substring in the same manner except we need to insert a simple check after the match to verify that it isn't surrounded by combining marks applying to the first or last character. And it'll also deeply simplify any proper Unicode string handling code we'll add in the future.

 - - -

That said, I fear that forcing a specific normalization might be problematic. You don't always want to have to normalize everything...

So perhaps we could simplify things a bit more: don't pick a standard normalization form. Just assume that both strings being used are in the same normalization form. Comparison will work, searching for substring in the way specified above will work, and other functions could document which normalization form they accept. Problem solved... somewhat.

It's even easier than this:

a) you want to do a proper string comparison not knowing what state the unicode strings are in, use the full-fledged decode-when-needed string type, and its associated str.find method. b) you know they are both the same normalized form and want to optimize, use std.algorithm.find(haystack.asArray, needle.asArray).


Reply via email to