On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
Or more to the point, do you know of any experience that you can share about code that attempts to process these sorts of strings on a per character basis? My suspicion is that any code that operates on such
strings, if they have any claim to correctness at all, must be
substring-based, rather than character-based.

That's pretty much it. Unless you are working in the confines of certain languages (alphabets, scripts, etc.), many notions that are valid for English or European languages lose meaning in general. This includes the notion of "characters" - at full abstraction, you can only treat a string as a stream of code units (or code points, if you wish, but as has been discussed to death this is rarely useful).

An application which has to handle user text (said text being possibly in any language), has to pretty much treat string variables as "holy":
- no indexing
- no slicing
- no counting anything
- no toUpper/toLower (std.ascii or std.uni)
etc.

All processing and transformations (line breaking, normalization, etc.) needs to be done using the relevant Unicode algorithms.

I've posted something earlier which I'd like to take back:

[a-z] makes sense in English, and [а-я] makes sense in Russian

[а-я] makes sense for Russian, but it doesn't for Ukrainian, in the same way how [a-z] is useless for Portuguese. There are probably only a few such ranges in Unicode which encompass exactly one alphabet, due to how much letters overlap across alphabets of similar languages.

Reply via email to