Luis P. Mendes wrote: > Errors occur when I assign the result of ''.join(cp for cp in de_str if > not unicodedata.category(cp).startswith('M')) to a variable. The same > happens with de_str. When I print the strings everything is ok. > > Here's a short example of data: > 115448,DAÇÃO > 117788,DA 1º DE MO Nº 2 > > I used the following script to convert the data: > # -*- coding: iso8859-15 -*- > > class Latin1ToAscii: > > def abreFicheiro(self): > import csv > self.reader = csv.reader(open(self.input_file, "rb")) > > def converter(self): > import unicodedata > self.lista_csv = [] > for row in self.reader: > s = unicode(row[1],"latin-1") > de_str = unicodedata.normalize("NFD", s) > nome = ''.join(cp for cp in de_str if not \ > unicodedata.category(cp).startswith('M')) > > linha_ascii = row[0] + "," + nome # * > print linha_ascii.encode("ascii") > self.lista_csv.append(linha_ascii) > > > def __init__(self): > self.input_file = 'nome_latin1.csv' > self.output_file = 'nome_ascii.csv' > > if __name__ == "__main__": > f = Latin1ToAscii() > f.abreFicheiro() > f.converter() > > > And I got the following result: > $ python latin1_to_ascii.py > 115448,DACAO > Traceback (most recent call last): > File "latin1_to_ascii.py", line 44, in ? > f.converter() > File "latin1_to_ascii.py", line 22, in converter > print linha_ascii.encode("ascii") > UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in > position 11: ordinal not in range(128) > > > The script converted the ÇÃ from the first line, but not the º from the > second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a > [u'115448,DAÇÃO'] element, which doesn't suit my needs. > > Would you mind telling me what should I change?
Calling this process "latin1 to ascii" was a misnomer, sorry that I used this phrase. It should be called "latin1 to search key", there is no requirement that the key must be ascii, so change the corresponding lines in your code: linha_key = row[0] + "," + nome print linha_key self.lista_csv.append(linha_key.encode("latin-1") With regards to º, Richie already gave you food for thoughts, if you want "1 DE MO" to match "1º DE MO" remove that symbol from the key (linha_key = linha_key.translate({u"º": None}), if you don't want such a fuzzy matching, keep it. -- http://mail.python.org/mailman/listinfo/python-list