On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
"spir"<denis.s...@gmail.com> wrote in message
news:mailman.619.1295012086.4748.digitalmar...@puremagic.com...
If anyone finds a pointer to such an explanation, bravo, and than you.
(You will certainly not find it in Unicode literature, for instance.)
Nick's explanation below is good and concise. (Just 2 notes added.)
Yea, most Unicode explanations seem to talk all about "code-units vs
code-points" and then they'll just have a brief note like "There's also
other things like digraphs and combining codes." And that'll be all they
mention.
You're right about the Unicode literature. It's the usual standards-body
documentation, same as W3C: "Instead of only some people understanding how
this works, lets encode the documentation in legalese (and have twenty
only-slightly-different versions) to make sure that nobody understands how
it works."
If anyone is interested, ICU's documentation is far more readable (and
intended for programmers). ICU is *the* reference library for dealing
with unicode (an IBM open source product, with C/C++/Java interfaces),
used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation:
http://userguide.icu-project.org/boundaryanalysis
Note that just like Unicode, they consider forming graphemes (grouping
codes into character representations) a simple particular case of text
segmentation, which they call "boundary analysis" (but they have the
nice idea to use "character" instead of "grapheme").
The only mention I found in ICU's doc of the issue we have talked about
here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings
The length of a string and all indexes and offsets related to the string
are always counted in terms of UChar code units, not in terms of UChar32
code points. (This is the same as in common C library functions that use
char * strings with multi-byte encodings.)
Often, a user thinks of a "character" as a complete unit in a language,
like an 'Ä', while it may be represented with multiple Unicode code
points including a base character and combining marks. (See the Unicode
standard for details.) This often requires users to index and pass
strings (UnicodeString or UChar *) with multiple code units or code
points. It cannot be done with single-integer character types. Indexing
of such "characters" is done with the BreakIterator class (in C: ubrk_
functions).
Even with such "higher-level" indexing functions, the actual index
values will be expressed in terms of UChar code units. When more than
one code unit is used at a time, the index value changes by more than
one at a time. [...]
(ICU's UChar are like D wchar.)
You can also say there are 2 kinds of characters: simple like "u"&
composite "ü" or "ü??". The former are coded with a single (base) code,
the latter with one (rarely more) base codes and an arbitrary number of
combining codes.
Couple questions about the "more than one base codes":
- Do you know an example offhand?
No. I know this only from it beeing mentionned in documentation. Unless
we consider (see below) L jamo as base codes.
- Does that mean like a ligature where the base codes form a single glyph,
or does it mean that the combining code either spans or operates over
multiple glyphs? Or can it go either way?
IIRC examples like ij in nederlands are only considered "compability
equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in
german. Meaning they should not be considered equal by default, this
would be an additional feature, and langage- and app-dependant). Unlike
base "e"+ combining "^" really == "ê".
For a majority of _common_ characters made of 2 or 3 codes (western
language letters, korean Hangul syllables,...), precombined codes have
been added to the set. Thus, they can be coded with a single code like
simple characters.
Out of curiosity, how do decomposed Hangul characters work? (Or do you
know?) Not actually knowing any Korean, my understanding is that they're a
set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
is like a series of base codes that automatically combine, or are there
combining characters involved?
I know nothing about Korean language except what I studied about its
scripting system for Unicode algorithms (but one can also code said
algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about
Hangul in Unicode
http://en.wikipedia.org/wiki/Korean_language_and_computers. What I
understand (beware, it's just wild deductions) is there are 3 kinds of
"jamo" scripting marks (noted L, V, T) that can combine into syllabic
"graphemes", resp in first, median, last place. These marks indeed
somehow correspond to vocalic or consonantic phonemes.
In unicode, in addition to such jamo, which are simple marks (like base
letters and diacritics in latin-based languages), there are precombined
codes for LV and LVT combinations (like for "ä" or "û"). We could thus
think that Hangul syllables are limited to 3 jamo.
But: according to Unicode's official "grapheme break cluster" algorithm
(read: how to group codepoints into characters)
(http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes
for L jamo can also be followed by _and_ should be combined with other
L, LV or LVT codes. Similarly, LV or V should be combined with V or VT,
and LVT or T with T. (Seems logical.) So, I do not know how complicated
a Hangul syllab can be in practice or in theory.
If there can be in practice whole syllables following other schemes than
L / LV / LVT, then this is another example of real language whole
characters that cannot be coded by a single codepoint.
Denis
_________________
vita es estrany
spir.wikidot.com