On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro wrote:
> On Saturday, October 21, 2017 at 5:11:13 PM UTC+13, Rustom Mody wrote:
> > Is there a recommended library for manipulating grapheme clusters?
> 
> Is this <http://anoopkunchukuttan.github.io/indic_nlp_library/> any good?

Thanks looks promising.
Dunno how much it lives up to the claims 
[For now the one liner from regex's findall has sufficed:
findall(r'\X', «text»)  

[Thanks MRAB for the library]
 
> Bear in mind that the logical representation of the text is as code points, 
> graphemes would have more to do with rendering.

Heh! Speak of Euro/Anglo-centrism!

In a sane world graphemes would be called letters
And unicode codepoints would be called something else — letterlets??
To be fair to the Unicode consortium, they strive hard to call them codepoints
But in an anglo-centric world, the conflation of codepoint to letter is 
inevitable I guess.
To hear how a non Roman-centric view of the world would sound:
A 'w' is a poorly double-struck 'u'
A 't' is a crossed 'l'
Reasonable?

The lead of https://en.wikipedia.org/wiki/%C3%9C has

| Ü, or ü, is a character…classified as a separate letter in several extended 
Latin alphabets 
| (including Azeri, Estonian, Hungarian and Turkish), but as the letter U with 
an 
| umlaut/diaeresis in others such as Catalan, French, Galician, German, Occitan 
and Spanish.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to