On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin
<michel.for...@michelf.com> wrote:
On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"
<schvei...@yahoo.com> said:
What if the source character is encoded differently than the search
string? This is basic unicode stuff. See my example with fiancé.
The more I think about it, the more I think it should work like this:
just like we assume they contain well-formed UTF sequences, char[],
wchar[], and dchar[] should also be assumed to contain **normalized**
unicode strings. Which normalization form to use? no idea. Just pick a
sensible one in the four.
<http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>
Once we know all strings are normalized in the same way, we can then
compare two strings bitwise to check if they're the same. And we can
check for a substring in the same manner except we need to insert a
simple check after the match to verify that it isn't surrounded by
combining marks applying to the first or last character. And it'll also
deeply simplify any proper Unicode string handling code we'll add in the
future.
- - -
That said, I fear that forcing a specific normalization might be
problematic. You don't always want to have to normalize everything...
<http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>
So perhaps we could simplify things a bit more: don't pick a standard
normalization form. Just assume that both strings being used are in the
same normalization form. Comparison will work, searching for substring
in the way specified above will work, and other functions could document
which normalization form they accept. Problem solved... somewhat.
It's even easier than this:
a) you want to do a proper string comparison not knowing what state the
unicode strings are in, use the full-fledged decode-when-needed string
type, and its associated str.find method.
b) you know they are both the same normalized form and want to optimize,
use std.algorithm.find(haystack.asArray, needle.asArray).
-Steve