2012/9/14 Tim Chase <python.l...@tim.thechases.com>: > On 09/13/12 16:44, Vlastimil Brom wrote: >> >>> import unicodedata >> >>> unicodedata.normalize("NFD", u"serviço móvil").encode("ascii", >> >>> "ignore").decode("ascii") >> u'servico movil' > > Works well for all the test-cases I threw at it. Thanks! > > -tkc > >
Hi, I am glad, it works, but I agree with the other comments, that it would be preferable to keep the original accented text, if at all possible in the whole processing. The above works by decomposing the accented characters into "basic" characters and the bare accents (combining diacritics) using normalize() and just striping anything outside ascii in encode("...", "ignore") This works for "combinable" accents, and most of the Portuguese characters outside of ascii appear to fall into this category, but there are others as well. E.g. according to http://tlt.its.psu.edu/suggestions/international/bylanguage/portuguese.html there are at least ºª«»€, which would be lost completely in such conversion. ª (dec.: 170) (hex.: 0xaa) # FEMININE ORDINAL INDICATOR º (dec.: 186) (hex.: 0xba) # MASCULINE ORDINAL INDICATOR You can preprocess such cases as appropriate before doing the conversion, e.g. just: >>> u"ºª«»€".replace(u"º", u"o").replace(u"ª", u"a").replace(u"«", >>> u'"').replace(u"»", u'"').replace(u"€", u"EUR") u'oa""EUR' >>> or using a more elegant function and the replacement lists (eventually handling other cases as well). regards, vbr -- http://mail.python.org/mailman/listinfo/python-list