On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
This can lead to subtle bugs, cf. length of random and e_one.
You have to convert everything to dstring to get the
"expected" result. However, this is not always desirable.
There are three things that you need to be aware of when
handling unicode: code units, code points and graphems.
This is why I use a helper function that uses byCodePoint and
byGrapheme. At least for my use cases it returns the correct
length. However, I might think about an alternative version based
on the discussion here.
In general the length of one guarantees anything about the
length of the other, except for utf32, which is a 1:1 mapping
between code units and code points.
In this thread, we were discussing the relationship between
code points and graphemes. You're examples however apply to the
relationship between code units and code points.
To measure the columns needed to print a string, you'll need
the number of graphemes. (d|)?string.length gives you the
number of code units.
If you normalize a string (in the sequence of
characters/codepoints sense, not object.string) to NFC, it will
decompose every precomposed character in the string (like é,
single codeunit), establish a defined order between the
composite characters and then recompose a selected few
graphemes (like é). This way é always ends up as a single code
unit in NFC. There are dozens of other combinations where
you'll still have n:1 mapping between code points and graphemes
left after normalization.
Example given already in this thread: putting an arrow over an
latin letter is typical in math and always more than one
codepoint.