Chris wrote: > hi, > to convert excel files via csv to xml or whatever I frequently use the > csv module which is really nice for quick scripts. problem are of course > non ascii characters like german umlauts, EURO currency symbol etc.
The umlauted characters should not be a problem, they're all in the first 256 characters. What makes you say they are a problem "of course"? > the current csv module cannot handle unicode the docs say, is there any > workaround or is unicode support planned for the near future? in most > cases support for characters in iso-8859-1(5) would be ok for my > purposes but of course full unicode support would be great... > Here's a perambulation through some of the alternatives: A. If you save the file from Excel as "Unicode text", you can pretty much DIY: >>> buff = file('csvtest.txt', 'rb').read() >>> lines = buff.decode('utf16').split(u'\r\n') >>> lines [u'M\xfcller\t"\u20ac1234,56"', u'M\xf6ller\t"\u20ac9876,54"', u'Kawasaki\t\xa53456.78', u''] >>> for line in lines: ... print line.split(u'\t') ... [u'M\xfcller', u'"\u20ac1234,56"'] [u'M\xf6ller', u'"\u20ac9876,54"'] [u'Kawasaki', u'\xa53456.78'] [u''] >>> All you have to do is handle (1) Excel's unnecessary quoting of the comma in the money amounts [see first two lines above; what it quotes is probably locale-dependent] (2) double quoting any quotes [no example given] (3) ignore the empty "line" introduced by split(). Problem (3) is easy: if not lines[-1:]: del lines[-1:] Hmmm ... by the time you finish this (and generalise it) you will have done the Unicode extension to the csv module ... Alternative B: you can do ODBC access to Excel spreadsheets; hmmm ... yuk ... no better than CSV i.e. you get the data in your current code page, not in Unicode: [('M\xfcller', '\x801234,56'), ('M\xf6ller', '\x809876,54'), ('Kawasaki', '\xa53456.78')] Alternative C: why not save your file as local-code-page .csv, use the csv module, and DIY decode: >>> rdr = csv.reader(file('csvtest.csv', 'rb')) >>> for row in rdr: ... print row ... urow = [x.decode('cp1252') for x in row] ... print urow ... ['Name', 'Amount'] [u'Name', u'Amount'] ['M\xfcller', '\x801234,56'] [u'M\xfcller', u'\u20ac1234,56'] ['M\xf6ller', '\x809876,54'] [u'M\xf6ller', u'\u20ac9876,54'] ['Kawasaki', '\xa53456.78'] [u'Kawasaki', u'\xa53456.78'] >>> Looks good to me, including the euro sign. HTH, John -- http://mail.python.org/mailman/listinfo/python-list