On Saturday, October 29, 2011 09:42:54 Andrei Alexandrescu wrote: > On 10/26/11 7:18 AM, Steven Schveighoffer wrote: > > On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas > > > > <simen.kja...@gmail.com> wrote: > >> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer > >> > >> <schvei...@yahoo.com> wrote: > >>> Plus, a combining character (such as an umlaut or accent) is part of > >>> a > >>> character, but may be a separate code point. > >> > >> If this is correct (and it is), then decoding to dchar is simply not > >> enough. > >> You seem to advocate decoding to graphemes, which is a whole different > >> matter. > > > > I am advocating that. And it's a matter of perception. D can say "we > > only support code-point decoding" and what that means to a user is, "we > > don't support language as you know it." Sure it's a part of unicode, but > > it takes that extra piece to make it actually usable to people who > > require unicode. > > > > Even in English, fiancé has an accent. To say D supports unicode, but > > then won't do a simple search on a file which contains a certain *valid* > > encoding of that word is disingenuous to say the least. > > Why doesn't that simple search work? > > foreach (line; stdin.byLine()) { > if (line.canFind("fiancé")) { > writeln("There it is."); > } > }
If the strings aren't normalized the same way, then it might not find fiancé. If they _are_ normalized the same way and fiancé is in there except that the é is actually modified by another code point after it (e.g. a subscript of 2 - not exactly likely in this case but certainly possible), then that string would be found when it shouldn't be. The bigger problem though, I think, is when you're searching for a string which is the same without the modifiers - which would be fiance in this case - since then if the modfiying code points are after, then find will think that it found the string that you were looking for when it didn't. Once you're dealing with modifying code points, in the general case, you _must_ operate on the grapheme level to ensure that you find exactly what you're looking for and only what you're looking for. If we assume that all strings are normalized the same way and pick the right normalization for it (and provide a function to normalize strings that way of course), then we could probably make that work 100% of the time (assuming that there's a normalized form with all of the modifying code points being _before_ the code point that we modify and that no modifying code point can be a character on its own), but I'd have to study up on it more to be sure. Regardless, while searching for fiancé has a decent chance of success (especially if programs generall favor using single code points instead of multiple code points wherever possible), it's still a risky proposition without at least doing unicode normalization if not outright using a range of graphemes rather than code points. - Jonathan M Davis