[issue30717] str.center() is not unicode aware

Steven D'Aprano Tue, 20 Jun 2017 17:07:41 -0700

Steven D'Aprano added the comment:

I don't think graphemes is the right term here. Graphemes are language 
dependent, for instance "ǆ" may be considered a grapheme in Croatian.


https://en.wikipedia.org/wiki/D%C5%BE
http://www.unicode.org/glossary/#grapheme

I believe you are referring to combining characters:

http://www.unicode.org/faq/char_combmark.html

It is unfortunate that Python's string methods are naive about combining 
characters, and just count code points, but I'm not sure what the alternative 
is. For example the human reader may be surprised that these give two different 
results:

py> len("naïve")
5
py> len("naïve")
6

I'm not sure if the effect will survive copying and pasting, but the first 
string uses 

U+00EF LATIN SMALL LETTER I WITH DIAERESIS

while the second uses:

U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS

And check out this surprising result:

py> "xïoz"[::-1]
'zöix'


It seems to me that it would be great if Python was fully aware of combining 
characters, its not so great if it is naïve, but it would be simply terrible if 
only a few methods were aware and the rest naïve.

I don't have a good solution to this, but perhaps an iterator over (base 
character + combining marks) would be a good first step. Something like this?

import unicodedata

def chars(string):
    accum = []
    for c in string:
        cat = unicodedata.category(c)
        if cat == 'Mn':
            accum.append(c)
        else:
            if accum:
                yield accum
                accum = []
            accum.append(c)
    if accum:
        yield accum

----------
nosy: +steven.daprano

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue30717>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue30717] str.center() is not unicode aware

Reply via email to