Re: Why the hell doesn't foreach decode strings

Steven Schveighoffer Wed, 26 Oct 2011 04:55:26 -0700

On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin<michel.for...@michelf.com> wrote:

On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"<schvei...@yahoo.com> said:
What if the source character is encoded differently than the search
string?  This is basic unicode stuff.  See my example with fiancé.
The more I think about it, the more I think it should work like this:just like we assume they contain well-formed UTF sequences, char[],wchar[], and dchar[] should also be assumed to contain **normalized**unicode strings. Which normalization form to use? no idea. Just pick asensible one in the four.
<http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>
Once we know all strings are normalized in the same way, we can thencompare two strings bitwise to check if they're the same. And we cancheck for a substring in the same manner except we need to insert asimple check after the match to verify that it isn't surrounded bycombining marks applying to the first or last character. And it'll alsodeeply simplify any proper Unicode string handling code we'll add in thefuture.
 - - -
That said, I fear that forcing a specific normalization might beproblematic. You don't always want to have to normalize everything...
<http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>
So perhaps we could simplify things a bit more: don't pick a standardnormalization form. Just assume that both strings being used are in thesame normalization form. Comparison will work, searching for substringin the way specified above will work, and other functions could documentwhich normalization form they accept. Problem solved... somewhat.


It's even easier than this:

a) you want to do a proper string comparison not knowing what state theunicode strings are in, use the full-fledged decode-when-needed stringtype, and its associated str.find method.b) you know they are both the same normalized form and want to optimize,use std.algorithm.find(haystack.asArray, needle.asArray).


-Steve

Re: Why the hell doesn't foreach decode strings

Reply via email to