STINNER Victor added the comment:
Extract of unicodedata_UCD_normalize_impl():
if (strcmp(form, "NFC") == 0) {
if (is_normalized(self, input, 1, 0)) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 0);
}
is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also
false for "\uafb8\u11a7\U0002f8a1").
unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged
because is_normalized() is true.
unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7":
U+afb8 is decomposed to {U+1101, U+116e}.
unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7"))
returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e,
U+11a7} is composed to {U+afb8}.
It may be an issue in the "quickcheck" property of the Python Unicode database.
Format of this field:
/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
as described in http://unicode.org/reports/tr15/#Annex8. */
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26917>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com