On Sat, Mar 08, 2014 at 08:38:40PM +0000, Vladimir Panteleev wrote: > On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu > wrote: > >Searching for characters in strings would be difficult to deem > >inappropriate. > > The notion of "character" exists only in certain writing systems. It > is thus a flawed practice, and I think it should not be encouraged, > as it will only make writing truly-international software more > difficult. A more correct approach is searching for a certain > substring. If non-exact matching is needed (normalization, case > insensitivity etc.), then the appropriate solution is to use the > Unicode algorithms.
+1. Most "character"-based Unicode string operations are actually *substring* operations, because the notion of "character" is not universal to every writing system, and doesn't map 1-to-1 to Unicode code points anyway. I would argue that most instances of code that perform character-based operations on strings are incorrect, in the sense that they will fail to correctly process strings in certain languages. [...] > >From experience with C++ I knew (1) had a bad track record, and > >(2) "generically conservative, specialize for speed" was a > >successful pattern. > > > >What would you have chosen given that context? > > Ideally, we would have the Unicode algorithms in the standard > library from day 1, and advocated their use throughout the > documentation. +1. I came to D expecting this to be the case... and was a little let down when I discovered the actual state of affairs in std.uni at the time. Thankfully, things have improved since, and all those who worked on that have my gratitude. But it's still not quite there yet. [...] > >>So the problem to me is that we're stuck not fixing something that's > >>horribly broken just because it's broken in a way that people > >>presumably now expect. > > > >Clearly I'm being subjective here but again I'd find it difficult to > >get convinced we have something horribly broken from the evidence I > >gathered inside and outside Facebook. > > Have you or anyone you personally know tried to process text in D > containing a writing system such as Sanskrit's? [...] Or more to the point, do you know of any experience that you can share about code that attempts to process these sorts of strings on a per character basis? My suspicion is that any code that operates on such strings, if they have any claim to correctness at all, must be substring-based, rather than character-based. T -- I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser