16-Oct-2013 23:42, qznc пишет:
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
On 2013-10-16 14:33, qznc wrote:

It is either [U+00E4] as one code point or [a,U+0308] for two code
points. The second is "combining diaeresis" [0]. Not required, but
possible. Those combining characters [1] provide a nearly infinite
number of combinations. You can go crazy with it:
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character

Aha, now I see.

One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r",
you can run a replace to replace 'a' with 'o'. Then, you'll get:
"boär" vs "boör"

Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's "letter" (eg,
sorting "oäa" should *not* generate "aaö". What it *should* generate
is up to debate), you can't entirely consider that a letter+grapheme
is a single entity.

Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In particular,
it does an awesome job of *teaching* the coder *what* unicode is.
Virtually everyone here has solid knowledge of unicode (I feel). They
understand, and can explain it, and can work with.

On the other hand, I don't know many C++ coders that understand unicode.

I agree with your point. Nevertheless you understanding of grapheme is
off. U+0308 is not a grapheme.  "a\u0308" is one grapheme. U+00e4 is the
same grapheme as "a\u0308".

s/the same/canonically equivalent/ :)


http://en.wikipedia.org/wiki/Grapheme


--
Dmitry Olshansky

Reply via email to