On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
"spir"<denis.s...@gmail.com>  wrote in message
news:mailman.619.1295012086.4748.digitalmar...@puremagic.com...

If anyone finds a pointer to such an explanation, bravo, and than you.
(You will certainly not find it in Unicode literature, for instance.)
Nick's explanation below is good and concise. (Just 2 notes added.)

Yea, most Unicode explanations seem to talk all about "code-units vs
code-points" and then they'll just have a brief note like "There's also
other things like digraphs and combining codes." And that'll be all they
mention.

You're right about the Unicode literature. It's the usual standards-body
documentation, same as W3C: "Instead of only some people understanding how
this works, lets encode the documentation in legalese (and have twenty
only-slightly-different versions) to make sure that nobody understands how
it works."

If anyone is interested, ICU's documentation is far more readable (and intended for programmers). ICU is *the* reference library for dealing with unicode (an IBM open source product, with C/C++/Java interfaces), used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation: http://userguide.icu-project.org/boundaryanalysis

Note that just like Unicode, they consider forming graphemes (grouping codes into character representations) a simple particular case of text segmentation, which they call "boundary analysis" (but they have the nice idea to use "character" instead of "grapheme").

The only mention I found in ICU's doc of the issue we have talked about here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings

The length of a string and all indexes and offsets related to the string are always counted in terms of UChar code units, not in terms of UChar32 code points. (This is the same as in common C library functions that use char * strings with multi-byte encodings.)

Often, a user thinks of a "character" as a complete unit in a language, like an 'Ä', while it may be represented with multiple Unicode code points including a base character and combining marks. (See the Unicode standard for details.) This often requires users to index and pass strings (UnicodeString or UChar *) with multiple code units or code points. It cannot be done with single-integer character types. Indexing of such "characters" is done with the BreakIterator class (in C: ubrk_ functions).

Even with such "higher-level" indexing functions, the actual index values will be expressed in terms of UChar code units. When more than one code unit is used at a time, the index value changes by more than one at a time. [...]

(ICU's UChar are like D wchar.)

You can also say there are 2 kinds of characters: simple like "u"&
composite "ü" or "ü??". The former are coded with a single (base) code,
the latter with one (rarely more) base codes and an arbitrary number of
combining codes.

Couple questions about the "more than one base codes":

- Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unless we consider (see below) L jamo as base codes.

- Does that mean like a ligature where the base codes form a single glyph,
or does it mean that the combining code either spans or operates over
multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compability equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in german. Meaning they should not be considered equal by default, this would be an additional feature, and langage- and app-dependant). Unlike base "e"+ combining "^" really == "ê".

For a majority of _common_ characters made of 2 or 3 codes (western
language letters, korean Hangul syllables,...), precombined codes have
been added to the set. Thus, they can be coded with a single code like
simple characters.


Out of curiosity, how do decomposed Hangul characters work? (Or do you
know?) Not actually knowing any Korean, my understanding is that they're a
set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
is like a series of base codes that automatically combine, or are there
combining characters involved?

I know nothing about Korean language except what I studied about its scripting system for Unicode algorithms (but one can also code said algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about Hangul in Unicode http://en.wikipedia.org/wiki/Korean_language_and_computers. What I understand (beware, it's just wild deductions) is there are 3 kinds of "jamo" scripting marks (noted L, V, T) that can combine into syllabic "graphemes", resp in first, median, last place. These marks indeed somehow correspond to vocalic or consonantic phonemes. In unicode, in addition to such jamo, which are simple marks (like base letters and diacritics in latin-based languages), there are precombined codes for LV and LVT combinations (like for "ä" or "û"). We could thus think that Hangul syllables are limited to 3 jamo. But: according to Unicode's official "grapheme break cluster" algorithm (read: how to group codepoints into characters) (http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes for L jamo can also be followed by _and_ should be combined with other L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, and LVT or T with T. (Seems logical.) So, I do not know how complicated a Hangul syllab can be in practice or in theory. If there can be in practice whole syllables following other schemes than L / LV / LVT, then this is another example of real language whole characters that cannot be coded by a single codepoint.


Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to