Luis P. Mendes wrote:
> Errors occur when I assign the result of ''.join(cp for cp in de_str if
> not unicodedata.category(cp).startswith('M')) to a variable.  The same
> happens with de_str.  When I print the strings everything is ok.
> Here's a short example of data:
> 115448,DAÇÃO
> 117788,DA 1º DE MO Nº 2
> I used the following script to convert the data:
> # -*- coding: iso8859-15 -*-
> class Latin1ToAscii:
>       def abreFicheiro(self):
>               import csv
>               self.reader = csv.reader(open(self.input_file, "rb"))
>       def converter(self):
>               import unicodedata
>               self.lista_csv = []
>               for row in self.reader:
>                       s = unicode(row[1],"latin-1")
>                       de_str = unicodedata.normalize("NFD", s)
>                       nome = ''.join(cp for cp in de_str if not \
>                       unicodedata.category(cp).startswith('M'))
>                       linha_ascii = row[0] + "," + nome  # *
>                       print linha_ascii.encode("ascii")
>                       self.lista_csv.append(linha_ascii)
>       def __init__(self):
>               self.input_file = 'nome_latin1.csv'
>               self.output_file = 'nome_ascii.csv'
> if __name__ == "__main__":
>       f = Latin1ToAscii()
>       f.abreFicheiro()
>       f.converter()
> And I got the following result:
> $ python
> 115448,DACAO
> Traceback (most recent call last):
>   File "", line 44, in ?
>     f.converter()
>   File "", line 22, in converter
>     print linha_ascii.encode("ascii")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
> position 11: ordinal not in range(128)
> The script converted the ÇÃ from the first line, but not the º from the
> second one.  Still in *, I also don't get a list as [115448,DAÇÃO] but a
> [u'115448,DAÇÃO'] element, which doesn't suit my needs.
> Would you mind telling me what should I change?

Calling this process "latin1 to ascii" was a misnomer, sorry that I
used this phrase. It should be called "latin1 to search key", there is
no requirement that the key must be ascii, so change the corresponding
lines in your code:

linha_key = row[0] + "," + nome
print linha_key

With regards to º, Richie already gave you food for thoughts, if you
want "1 DE MO" to match "1º DE MO" remove that symbol from the key
(linha_key = linha_key.translate({u"º": None}), if you don't want such
a fuzzy matching, keep it.


Reply via email to