Guillaume Sanchez added the comment:

Thanks for all those interesting cases you brought here! I didn't think of that 
at all!

I'm using the word "grapheme" as per the definition given in UAX TR29 which is 
*not* language/locale dependant [1].

This annex is very specific and precise about where to break "grapheme cluster" 
aka "when does a character starts and ends". Sadly, it's a bit more complex 
than just accumulating based on the Combining property. This annex gives a set 
of rules to implement, based on Grapheme_Cluster_Break property, and while 
those rules may naively be implemented as comparing adjacent pairs of code 
points, this is wrong and can be correctly and efficiently implemented as an 
automaton. My code [2] passes all tests from GraphemeBreakTests.txt (provided 
by Unicode).

We can definitely do a generator like you propose, or rather do it in the C 
layer to gain more efficiency and coherence since the other string / Unicode 
operations are in the C layer (upper, lower, casefold, etc)

Let me know what you guys think, what (and if) I should contribute :)

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2] 
https://github.com/Vermeille/batriz/blob/master/src/str/grapheme_iterator.h#L31

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to