On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
Or more to the point, do you know of any experience that you
can share
about code that attempts to process these sorts of strings on a
per
character basis? My suspicion is that any code that operates on
such
strings, if they have any claim to correctness at all, must be
substring-based, rather than character-based.
That's pretty much it. Unless you are working in the confines of
certain languages (alphabets, scripts, etc.), many notions that
are valid for English or European languages lose meaning in
general. This includes the notion of "characters" - at full
abstraction, you can only treat a string as a stream of code
units (or code points, if you wish, but as has been discussed to
death this is rarely useful).
An application which has to handle user text (said text being
possibly in any language), has to pretty much treat string
variables as "holy":
- no indexing
- no slicing
- no counting anything
- no toUpper/toLower (std.ascii or std.uni)
etc.
All processing and transformations (line breaking, normalization,
etc.) needs to be done using the relevant Unicode algorithms.
I've posted something earlier which I'd like to take back:
[a-z] makes sense in English, and [а-я] makes sense in Russian
[а-я] makes sense for Russian, but it doesn't for Ukrainian, in
the same way how [a-z] is useless for Portuguese. There are
probably only a few such ranges in Unicode which encompass
exactly one alphabet, due to how much letters overlap across
alphabets of similar languages.