> <http://www.unicode.org/reports/tr15/> says: > > int SIndex = last - SBase; ...
The arithmetic decomposition of the Hangul Syllable characters can be described as follows: Each Hangul precomposed syllable character of Hangul_Syllable_Type LV has a canonical decomposition into an L and a V Hangul jamo: LV: s L in 1100–1112: LBase + ((s – SBase) div NCount) V in 1161–1175: VBase + (((s – SBase) mod NCount) div TCountP1) Each Hangul precomposed syllable character of Hangul_Syllable_Type LVT has a canonical decomposition into an LV Hangul syllable character and a T Hangul jamo: LVT: s LV: SBase + (((s – SBase) div NCount) * NCount) T in 11A8–11C2: TBaseM1 + ((s – SBase) mod TCountP1) (TBaseM1 is TBase-1, and TCountP1 is TCount+1) This makes them decompose just like other canonical decompositions into (one or) two other characters; not more than two. The arithmetic description is then just a shorthand for a long list of 11000+ canonical decompositions (which can't be into more than two other characters). They could in principle be handled in normalisation code just like any other canonical decomposition/composition, given that expanded table. Code based on the arithmetic expressions are just more efficient in achieving the same thing. The composition can likewise be described arithmetically. Note the use of the (relatively) new Hangul_Syllable_Type property. Some pseudo-code (for those who like code) based on this for composing Hangul Syllable characters (I will spare you the pseudocode for decomposing, this reply is getting too long already): public static String composeHangul(String source) { int len = source.length(); if (len == 0) return ""; StringBuffer result = new StringBuffer(); // Hangul is in the BMP, so we need not worry about higher planes. char prev = source.charAt(0); // get first char for (int i = 1; i < len; i++) { char curr = source.charAt(i); if ('\u1100' <= prev && prev <= '\u1112' && // "modern" L '\u1161' <= curr && curr <= '\u1175') // "modern" V { // make a syllable of the form LV prev = (char)(SBase + ((prev–LBase) * NCount) + ((curr–VBase) * TCountP1)); } else if (hangulSyllableType(prev) == HangulSyllableType.LV && '\u11A8' <= curr && curr <= '\u11C2') // "modern" T { // make a syllable of the form LVT prev += curr – TBaseM1; } else { // no arithmetic composition possible, move on result.append(prev); prev = curr; } } result.append(prev); // don't loose last char in string return result.toString(); } Note that, while NOT part of Unicode decompositions, many of the Hangul Jamo characters decompose into two or three other Hangul Jamo letters. But that is much beyond UAX 15, unfortunately. /kent k