Walter Bright schrieb:
spir wrote:
In my views, there is a missing level of abstraction in common UString
processing libs and types. How to count the "â"s in a text? How to
find one? Above, indexOf fails because my editor uses a precombined
code, while the source (here literal) uses another form.
To be able to produce meaningful results, and to use simple routines
like index, find, count..., the way we used to with single-length
character sets, there should be a grouping phase on top of decoding;
we would then process arrays of "stacks" representing characters, not
of codes. ITo search, it's also necessary to have all characters
normalised form, so that both "â" would match: another phase.
Unicode provides algorithms for those phases in constructing string
representations -- but everyone seems to ignore the issues... s[0..1]
would then return the first character, not the first code of the
"stack" representing the first character.
http://www.digitalmars.com/d/2.0/phobos/std_utf.html
If I'm not mistaken, those functions don't handle these "graphemes", i.e.
something that appears like one character on the screen, but consists of
multiple code *points*. Like spir's "â" that, in UTF-8, is encoded with the
following bytes: 0x61 (=='a'), 0xCC, 0x82. (Or \u0061\u0302 in UTF-32).
Also, a function returning the physical position (i.e. pos in arrray of chars or
wchars) of logical char #logPos may be useful, e.g. for fixed width printing stuff:
size_t getPhysPos(char[] str, size_t logPos)
Cheers,
- Daniel