Peter Bengtsson <[EMAIL PROTECTED]> writes: > In UTF8, \u0141 is a capital L with a little dash through it as can be > seen in this image: > http://static.peterbe.com/lukasz.png > > I tried this: >>>> import unicodedata >>>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore') > '' > > I was hoping it would convert it it 'L' because that's what it > visually looks like. And I've seen it becoming a normal ascii L before > in other programs such as Thunderbird. > > I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but > none of them helped. > > What am I doing wrong?
I had the same problem and my little research revealed that the problem is caused by unicode standard itself. I don't know why but characters with stroke don't have canonical equivalent. I looked into this file: http://unicode.org/Public/UNIDATA/UnicodeData.txt and compared two positions: 1. <UnicodeData.txt> 0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \ ;;0141;;0141 0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \ ;;;0142; </UnicodeData.txt> 2. <UnicodeData.txt> 0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \ ;;0104;;0104 </UnicodeData.txt> In the second position there is in the 6-th field canonical equivalent but in the 1-st there is nothing. I don't know what justification is behind that, but probably there is something. ;) Regards, Rob -- http://mail.python.org/mailman/listinfo/python-list