On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote: > On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote: > > > >+1 > >In Indian languages, a character consists of one or more UNICODE > >code points. For example, in Sanskrit "ddhrya" > >http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg > >consists of 7 UNICODE code points. So to search for this char I > >have to use string search. > > > >- Sarath > > Oops, incomplete reply ... > > Since a single "alphabet" in Indian languages can contain multiple > code-points, iterating over single code-points is like iterating > over char[] for non English European languages. So decode is of no > use other than decreasing the performance. A raw char[] comparison > is much faster.
Yes. The more I think about it, the more auto-decoding sounds like a wrong decision. The question, though, is whether it's worth the massive code breakage needed to undo it. :-( > And then there is this "unicode normalization" that makes it very > difficult for string searches or comparisons. [...] I believe the convention is to always normalize strings before performing operations on them, in order to prevent these sorts of problems. I think many of the unicode prescribed algorithms have normalization as a prerequisite, since otherwise there's no guarantee that the algorithm will produce the correct results. T -- "I'm not childish; I'm just in touch with the child within!" - RL