On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote: > On 12/2/13 3:38 PM, Ethan Furman wrote: >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote: >>> >>> Out of the nine tests, Python 3.3 passes six, with three tests being >>> failures or dubious. If you believe that the native string type should >>> operate on code-points, then you'll think that Python does the right >>> thing. >> >> I think Python is doing it correctly. If I want to operate on >> "clusters" I'll normalize the string first. >> >> Thanks for this excellent post. >> >> -- >> ~Ethan~ > > This is where my knowledge about Unicode gets fuzzy. Isn't it the case > that some grapheme clusters (or whatever the right word is) can't be > normalized down to a single code point? Characters can accept many > accents, for example. In that case, you can't always normalize and use > the existing string methods, but would need more specialized code.
That is correct. If Unicode had a distinct code point for every possible combination of base-character plus an arbitrary number of diacritics or accents, the 0x10FFFF code points wouldn't be anywhere near enough. I see over 300 diacritics used just in the first 5000 code points. Let's pretend that's only 100, and that you can use up to a maximum of 5 at a time. That gives 79375496 combinations per base character, much larger than the total number of Unicode code points in total. If anyone wishes to check my logic: # count distinct combining chars import unicodedata s = ''.join(chr(i) for i in range(33, 5000)) s = unicodedata.normalize('NFD', s) t = [c for c in s if unicodedata.combining(c)] len(set(t)) # calculate the number of combinations def comb(r, n): """Combinations nCr""" p = 1 for i in range(r+1, n+1): p *= i for i in range(1, n-r+1): p /= i return p sum(comb(i, 100) for i in range(6)) I'm not suggesting that all of those accents are necessarily in use in the real world, but there are languages which construct arbitrary combinations of accents. (Or so I have been lead to believe.) -- Steven -- https://mail.python.org/mailman/listinfo/python-list