Duncan Booth wrote: > There's a nice little codec from Skip Montaro for removing accents from > latin-1 encoded strings. It also has an error handler so you can convert > from unicode to ascii and strip all the accents as you do so: > > http://orca.mojam.com/~skip/python/latscii.py > >>>> import latscii >>>> import htmlentitydefs >>>> print u'\u00c9'.encode('ascii','replacelatscii') > E >>>> > > So Bussiere could replace a large chunk of his code with: > > ligneA = ligneA.decode(INPUTENCODING).encode('ascii', > 'replacelatscii') ligneA = ligneA.upper() > > INPUTENCODING is 'utf8' unless (one possible explanation for his problem) > his files are actually in some different encoding. > > Unfortunately, just as I finished writing this I discovered that the > latscii module isn't as robust as I thought, it blows up on consecutive > accented characters. > > :(
You made me look into it -- and I found that reusing the decoding map as the encoding map lets you write >>> u"Élève ééé".encode("latscii") 'Eleve eee' without relying on the faulty error handler. I tried to fix the handler, too: >>> u"Élève ééé".encode("ascii", "replacelatscii") 'Eleve eee' >>> g = u"\N{GREEK CAPITAL LETTER GAMMA}" >>> (u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii") 'moglich ahnlich ublich aaa???' No real testing was performed. Peter --- latscii_old.py 2006-03-24 11:45:22.580588520 +0100 +++ latscii.py 2006-03-24 11:48:13.191651696 +0100 @@ -141,7 +141,7 @@ ### Encoding Map -encoding_map = codecs.make_identity_dict(range(256)) +encoding_map = decoding_map ### From Martin Blais @@ -166,9 +166,9 @@ ## ustr.encode('ascii', 'replacelatscii') ## def latscii_error( uerr ): - key = ord(uerr.object[uerr.start:uerr.end]) + key = ord(uerr.object[uerr.start]) try: - return unichr(decoding_map[key]), uerr.end + return unichr(decoding_map[key]), uerr.start + 1 except KeyError: handler = codecs.lookup_error('replace') return handler(uerr) -- http://mail.python.org/mailman/listinfo/python-list