Don't know if this has been asked/reported before, but is the example code
for hangul composition in UAX 15 correct?

The code is:
    public static String composeHangul(String source) {
        int len = source.length();
        if (len == 0) return "";
        StringBuffer result = new StringBuffer();
        char last = source.charAt(0);            // copy first char
        result.append(last);

        for (int i = 1; i < len; ++i) {
            char ch = source.charAt(i);

            // 1. check to see if two current characters are L and V

            int LIndex = last - LBase;
            if (0 <= LIndex && LIndex < LCount) {
                int VIndex = ch - VBase;
                if (0 <= VIndex && VIndex < VCount) {

                    // make syllable of form LV

                    last = (char)(SBase + (LIndex * VCount + VIndex) * TCount);
                    result.setCharAt(result.length()-1, last); // reset last
                    continue; // discard ch
                }
            }

            // 2. check to see if two current characters are LV and T

            int SIndex = last - SBase;
            if (0 <= SIndex && SIndex < SCount && (SIndex % TCount) == 0) {
                int TIndex = ch - TBase;
                if (0 <= TIndex && TIndex <= TCount) {

                    // make syllable of form LVT

                    last += TIndex;
                    result.setCharAt(result.length()-1, last); // reset last
                    continue; // discard ch
                }
            }

            // if neither case was true, just add the character

            last = ch;
            result.append(ch);
        }
        return result.toString();
    }

Suppose I feed it 0xAC00 0x11C3. 0xAC00 is an LV.
This will do step 2:

SIndex = 0xAC00 - 0xAC00 = 0
TIndex = 0x11C3 - 0x11A7 = 28

Which causes the "(0 <= TIndex && TIndex <= TCount)" to be true.
And the resulting output is 0xAC00 + 28 = 0xAC1C which is not
an LVT but an LV syllable!

The TIndex <= TCount should be TIndex < TCount I think. IMO the
example would be more clear if the Hangul_Syllable_Type property
would be used.


A somewhat related question. I know next to nothing about Hangul [de]composition so forgive me for asking silly questions. In the UnicodeData.txt file there are much more than the 19 L, 21 V, and 28 L jamos. Are the other jamos not use to compose syllables, or does the syllable block represent an incomplete set of compatibility characters? What's is it?

Theo




Reply via email to