[issue26917] Inconsistency in unicodedata.normalize()?

STINNER Victor Tue, 03 May 2016 02:26:11 -0700

STINNER Victor added the comment:

Extract of unicodedata_UCD_normalize_impl():


    if (strcmp(form, "NFC") == 0) {
        if (is_normalized(self, input, 1, 0)) {
            Py_INCREF(input);
            return input;
        }
        return nfc_nfkc(self, input, 0);
    }

is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also 
false for "\uafb8\u11a7\U0002f8a1").

unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged 
because is_normalized() is true.

unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": 
U+afb8 is decomposed to {U+1101, U+116e}.

unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) 
returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, 
U+11a7} is composed to {U+afb8}.

It may be an issue in the "quickcheck" property of the Python Unicode database. 
Format of this field:

    /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
       as described in http://unicode.org/reports/tr15/#Annex8. */
    quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26917>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26917] Inconsistency in unicodedata.normalize()?

Reply via email to