On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
> On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
> >
> >+1
> >In Indian languages, a character consists of one or more UNICODE
> >code points. For example, in Sanskrit "ddhrya"
> >http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
> >consists of 7 UNICODE code points. So to search for this char I
> >have to use string search.
> >
> >- Sarath
> 
> Oops, incomplete reply ...
> 
> Since a single "alphabet" in Indian languages can contain multiple
> code-points, iterating over single code-points is like iterating
> over char[] for non English European languages. So decode is of no
> use other than decreasing the performance. A raw char[] comparison
> is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a
wrong decision. The question, though, is whether it's worth the massive
code breakage needed to undo it. :-(


> And then there is this "unicode normalization" that makes it very
> difficult for string searches or comparisons.
[...]

I believe the convention is to always normalize strings before
performing operations on them, in order to prevent these sorts of
problems. I think many of the unicode prescribed algorithms have
normalization as a prerequisite, since otherwise there's no guarantee
that the algorithm will produce the correct results.


T

-- 
"I'm not childish; I'm just in touch with the child within!" - RL

Reply via email to