Chris wrote: > hi, > thanks for all replies, I try if I can at least get the work done. > > I guess my problem mainly was the rather mindflexing (at least for me) > coding/decoding of strings... > > But I guess it would be really helpful to put the UnicodeReader/Writer > in the docs
UNFORTUNATELY the solution of saving the Excel .XLS to a .CSV doesn't work if you have Unicode characters that are not in your Windows code-page. Nor would it work in a CJK environment if the file was saved in an MBCS encoding (e.g. Big5). A work-around appears possible, with some more effort: I have extended the previous sample XLS; there is now a last line with IVANOV in Cyrillic letters [pardon my spelling etc etc if necessary]. My code-page is cp1252, which sure don't grok Russki :-) I've saved it as CSV [no complaint from Excel] and as "Unicode text". >>> buffc = file('csvtest2.csv', 'rb').read() >>> buffc 'Name,Amount\r\nM\xfcller,"\x801234,56"\r\nM\xf6ller,"\x809876,54"\r\nKawasaki,\xa53456.78\r\n??????,"?5678,90"\r\n' Thanks a lot, Bill! That's really clever. >>> buffu16 = file('csvtest2.txt', 'rb').read() >>> buffu16 '\xff\xfeN\x00a\x00m\x00e\x00\t\x00A\x00m\x00o\x00u\x00n\x00t\x00\r\x00\n\x00 [snip] \x18\x04\x12\x04 \x10\x04\x1d\x04\x1e\x04\x12\x04\t\x00"\x00 \x045\x006\x007\x008\x00,\x009\x000\x00"\x00\r\x00\n\x00' >>> buffu = buffu16.decode('utf16') >>> buffu u'Name\tAmount\r\nM\xfcller\t"\u20ac1234,56"\r\nM\xf6ller\t"\u20ac9876,54"\r\nKawasaki\t\xa53456.78\r\n\u0418\u0412\u0410\u041d\u041 e\u0412\t"\u04205678,90"\r\n' Aside: this has removed the BOM. I understood (possibly incorrectly) from a recent thread that Python codecs left the BOM in there, but hey I'm not complaining :-) As expected, this looks OK. The extra step required in the work-around is to convert the utf16 file to utf8 and feed that to the csv reader. Why utf8? (1) Every Unicode character can be represented, not just ones in that are in your code-page (2) ASCII characters can't appear as part of the representation of any other character -- i.e. ones that are significant to csv (tab, comma, quote, \r, \n) can't cause errors by showing up as part of another character e.g. CJK characters. >>> buffu8 = buffu.encode('utf8') >>> buffu8 'Name\tAmount\r\nM\xc3\xbcller\t"\xe2\x82\xac1234,56"\r\nM\xc3\xb6ller\t"\xe2\x82\xac9876,54"\r\nKawasaki\t\xc2\xa53456.78\r\n\xd0\x 98\xd0\x92\xd0\x90\xd0\x9d\xd0\x9e\xd0\x92\t"\xd0\xa05678,90"\r\n' >>> x = file('csvtest2.u8', 'wb') >>> x.write(buffu8) >>> x.close() >>> import csv >>> rdr = csv.reader(file('csvtest2.u8', 'rb'), delimiter='\t') >>> for row in rdr: ... print row ... print [x.decode('utf8') for x in row] ... ['Name', 'Amount'] [u'Name', u'Amount'] ['M\xc3\xbcller', '\xe2\x82\xac1234,56'] [u'M\xfcller', u'\u20ac1234,56'] ['M\xc3\xb6ller', '\xe2\x82\xac9876,54'] [u'M\xf6ller', u'\u20ac9876,54'] ['Kawasaki', '\xc2\xa53456.78'] [u'Kawasaki', u'\xa53456.78'] ['\xd0\x98\xd0\x92\xd0\x90\xd0\x9d\xd0\x9e\xd0\x92', '\xd0\xa05678,90'] [u'\u0418\u0412\u0410\u041d\u041e\u0412', u'\u04205678,90'] >>> Howzat? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list