On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
Unicode graphemes are not always the same as graphemes in
natural (written) languages. If <é> is composed in Unicode, it
is still one grapheme in a written language, not two distinct
characters. However, in natural languages two characters can
be one grapheme, as in English <sh>, it represents the sound
in `shower, shop, fish`. In German the same sound is
represented by three characters <sch> as in `Schaf` ("sheep").
A bit nit-picky but we should make clear that we talk about
"Unicode graphemes" that map to single characters on the
written page. But is that at all possible across all languages?
To avoid confusion and misunderstandings we should agree on
the terminology first.
No, this is well established terminology, you are confusing
several things here:
- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode
Graphemes are built from one or more codepoints.
Phonemes are a different topic and not really covered by the
unicode standard AFAIK. Except for the IPA notation, but these
are again graphemes that represent phonemes.
I am pretty sure that a single grapheme in unicode does not
correspond to your notion of "character". I am pretty sure that
what you think of as a "character" is officially called "Grapheme
Cluster" not "Grapheme".
See here: http://www.unicode.org/glossary/#grapheme_cluster