On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro wrote: > On Saturday, October 21, 2017 at 5:11:13 PM UTC+13, Rustom Mody wrote: > > Is there a recommended library for manipulating grapheme clusters? > > Is this <http://anoopkunchukuttan.github.io/indic_nlp_library/> any good?
Thanks looks promising. Dunno how much it lives up to the claims [For now the one liner from regex's findall has sufficed: findall(r'\X', «text») [Thanks MRAB for the library] > Bear in mind that the logical representation of the text is as code points, > graphemes would have more to do with rendering. Heh! Speak of Euro/Anglo-centrism! In a sane world graphemes would be called letters And unicode codepoints would be called something else — letterlets?? To be fair to the Unicode consortium, they strive hard to call them codepoints But in an anglo-centric world, the conflation of codepoint to letter is inevitable I guess. To hear how a non Roman-centric view of the world would sound: A 'w' is a poorly double-struck 'u' A 't' is a crossed 'l' Reasonable? The lead of https://en.wikipedia.org/wiki/%C3%9C has | Ü, or ü, is a character…classified as a separate letter in several extended Latin alphabets | (including Azeri, Estonian, Hungarian and Turkish), but as the letter U with an | umlaut/diaeresis in others such as Catalan, French, Galician, German, Occitan and Spanish. -- https://mail.python.org/mailman/listinfo/python-list