Hugo Arts wrote: > 2011/4/3 "Andrés Chandía" <and...@chandia.net>: >> >> >> I continue working with RegExp, but I have reached a point for wich I >> can't find documentation, maybe there is no possible way to do it, any >> way I throw the question: >> >> This is my code: >> >> contents = re.sub(r'Á', >> "A", contents) >> contents = re.sub(r'á', "a", >> contents) >> contents = re.sub(r'É', "E", contents) >> contents = re.sub(r'é', "e", contents) >> contents = re.sub(r'Í', "I", contents) >> contents = re.sub(r'í', "i", contents) >> contents = re.sub(r'Ó', "O", contents) >> contents = re.sub(r'ó', "o", contents) >> contents = re.sub(r'Ú', "U", contents) >> contents = re.sub(r'ú', "u", contents) >> >> It is >> clear that I need to convert any accented vowel into the same not >> accented vowel, The >> qestion is : is there a way to say that whenever you find an accented >> character this one >> has to change into a non accented character, but not every character, it >> must be only this vowels and accented this way, because at the language I >> am working with, there are letters >> like ü, and ñ that should remain the same. >> > > Okay, first thing, forget about regexes for this problem.They're too > complicated and not suited to it. > > Encoding issues make this a somewhat complicated problem. In Unicode, > There's two ways to encode most accented characters. For example, the > character "Ć" can be encoded both by U+0106, "LATIN CAPITAL LETTER C > WITH ACUTE", and a combination of U+0043 and U+0301, being simply 'C' > and the 'COMBINING ACUTE ACCENT', respectively. You must remove both > forms to be sure every accented character is gone from your string. > > using unicode.translate, you can craft a translation table to > translate the accented characters to their non-accented counterparts. > The combining characters can simply be removed by mapping them to > None.
If you go that road you might be interested in Fredrik Lundh's article at http://effbot.org/zone/unicode-convert.htm The class presented there is a bit tricky, but for your purpose it might be sufficient to subclass it: >>> KEEP_CHARS = set(ord(c) for c in u"üñ") >>> class Map(unaccented_map): ... def __missing__(self, key): ... if key in KEEP_CHARS: ... self[key] = key ... return key ... return unaccented_map.__missing__(self, key) ... >>> print u"äöü".translate(Map()) aoü _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor