Re: VLERange: a range in between BidirectionalRange andRandomAccessRange

spir Sun, 16 Jan 2011 14:13:26 -0800

On 01/14/2011 08:20 PM, Nick Sabalausky wrote:

"spir"<denis.s...@gmail.com>  wrote in message
news:mailman.619.1295012086.4748.digitalmar...@puremagic.com...


If anyone finds a pointer to such an explanation, bravo, and than you.
(You will certainly not find it in Unicode literature, for instance.)
Nick's explanation below is good and concise. (Just 2 notes added.)


Yea, most Unicode explanations seem to talk all about "code-units vs
code-points" and then they'll just have a brief note like "There's also
other things like digraphs and combining codes." And that'll be all they
mention.

You're right about the Unicode literature. It's the usual standards-body
documentation, same as W3C: "Instead of only some people understanding how
this works, lets encode the documentation in legalese (and have twenty
only-slightly-different versions) to make sure that nobody understands how
it works."

If anyone is interested, ICU's documentation is far more readable (andintended for programmers). ICU is *the* reference library for dealingwith unicode (an IBM open source product, with C/C++/Java interfaces),used by many other products in the background.

ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/

section about text segmentation:http://userguide.icu-project.org/boundaryanalysis

Note that just like Unicode, they consider forming graphemes (groupingcodes into character representations) a simple particular case of textsegmentation, which they call "boundary analysis" (but they have thenice idea to use "character" instead of "grapheme").

The only mention I found in ICU's doc of the issue we have talked abouthere lengthily is (at http://userguide.icu-project.org/strings):

"Handling Lengths, Indexes, and Offsets in Strings

The length of a string and all indexes and offsets related to the stringare always counted in terms of UChar code units, not in terms of UChar32code points. (This is the same as in common C library functions that usechar * strings with multi-byte encodings.)

Often, a user thinks of a "character" as a complete unit in a language,like an 'Ä', while it may be represented with multiple Unicode codepoints including a base character and combining marks. (See the Unicodestandard for details.) This often requires users to index and passstrings (UnicodeString or UChar *) with multiple code units or codepoints. It cannot be done with single-integer character types. Indexingof such "characters" is done with the BreakIterator class (in C: ubrk_functions).

Even with such "higher-level" indexing functions, the actual indexvalues will be expressed in terms of UChar code units. When more thanone code unit is used at a time, the index value changes by more thanone at a time. [...]


(ICU's UChar are like D wchar.)

You can also say there are 2 kinds of characters: simple like "u"&
composite "ü" or "ü??". The former are coded with a single (base) code,
the latter with one (rarely more) base codes and an arbitrary number of
combining codes.


Couple questions about the "more than one base codes":

- Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unlesswe consider (see below) L jamo as base codes.

- Does that mean like a ligature where the base codes form a single glyph,
or does it mean that the combining code either spans or operates over
multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compabilityequivalent" to the corresponding ligatures, just like eg "ss" for "ß" ingerman. Meaning they should not be considered equal by default, thiswould be an additional feature, and langage- and app-dependant). Unlikebase "e"+ combining "^" really == "ê".

For a majority of _common_ characters made of 2 or 3 codes (western
language letters, korean Hangul syllables,...), precombined codes have
been added to the set. Thus, they can be coded with a single code like
simple characters.


Out of curiosity, how do decomposed Hangul characters work? (Or do you
know?) Not actually knowing any Korean, my understanding is that they're a
set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
is like a series of base codes that automatically combine, or are there
combining characters involved?

I know nothing about Korean language except what I studied about itsscripting system for Unicode algorithms (but one can also code saidalgorithm blindly). See http://en.wikipedia.org/wiki/Hangul and aboutHangul in Unicodehttp://en.wikipedia.org/wiki/Korean_language_and_computers. What Iunderstand (beware, it's just wild deductions) is there are 3 kinds of"jamo" scripting marks (noted L, V, T) that can combine into syllabic"graphemes", resp in first, median, last place. These marks indeedsomehow correspond to vocalic or consonantic phonemes.In unicode, in addition to such jamo, which are simple marks (like baseletters and diacritics in latin-based languages), there are precombinedcodes for LV and LVT combinations (like for "ä" or "û"). We could thusthink that Hangul syllables are limited to 3 jamo.But: according to Unicode's official "grapheme break cluster" algorithm(read: how to group codepoints into characters)(http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codesfor L jamo can also be followed by _and_ should be combined with otherL, LV or LVT codes. Similarly, LV or V should be combined with V or VT,and LVT or T with T. (Seems logical.) So, I do not know how complicateda Hangul syllab can be in practice or in theory.If there can be in practice whole syllables following other schemes thanL / LV / LVT, then this is another example of real language wholecharacters that cannot be coded by a single codepoint.



Denis
_________________
vita es estrany
spir.wikidot.com

Re: VLERange: a range in between BidirectionalRange andRandomAccessRange

Reply via email to