Ksenia Marasanova wrote: > > > There is some amount of data in a database (PG) that must be inserted > > > into Excel sheet and emailed. Nothing special, everything works. > > > Except that non-ascii characters are not displayed properly. > > > The data is stored as XML into a text field. > > > > This sentence doesn't make much sense. Explain. > > Sorry, I meant: I use field of the type 'text' in a Postgres table to > store my data. The data is a XML string. > > > Instead of "print data", do "print repr(data)" and show us what you > > get. What *you* see on the screen is not much use for diagnosis; it's > > the values of the bytes in the file that matter. > > Thanks for this valuable tip. I take letter "Ã" as an example. > > "print repr(data)" shows this: > u'\xe9' > > > > Open the spreadsheet with Microsoft Excel, copy-and-paste some data to > > a Notepad window, save the Notepad file as Unicode type named (say) > > "junk.u16" then at the Python interactive prompt do this: > > > > file("junk.u16", "rb").read().decode("utf16") > > > > and show us what you get. > > (I am on a Mac so I used Textedit to create a UTF-16 encoded file, right?) > The result from Python is: > > u'\u0439'
So Unicode U+00E9 has become U+0439? Magic! I suspect that there is a conversion step or two in there that you haven't mentioned. Are you talking about the spreadsheet after it is created by your script on the machine that created it [which is what I asked], or are you talking about the spreadsheet on the recipient's machine, or are you talking about the spreadsheet after the recipient has e-mailed it back to you, hopefully untouched? > > In Excel sheet it is shown as: Ð > > (Russian again?!) This is probably indicative that the Latin-1 e-acute (0xE9) is being converted to Unicode U+0439 by something that thinks it is actually in an 8-bit Cyrillic encoding (0xE9 is "Cyrillic small letter short I" in some 8-bit encodings) but the u-umlaut becoming GHE example doesn't fit this story. Please do a test where you put several different accented Latin letters in the one field in your database. No, put ALL the non-ASCII characters that you expect to be transmitted unchanged into test fields -- this will make you think about what language(s)/locale(s) your database is designed for and what language(s)/locale(s) your e-mail targets use. Having this test data will be useful in the future for verifying that your system works. Repeat all the above steps. Tell us what you see in Excel on your machine and on the recipient's machine. > > > > > > and Python email package to email it... and the resulting sheet > > > is not good: > > > > E-mailed how? To whom? [I.e. what country / what cultural background / > > on what machine / what operating system / viewed using what software] > > Emailed with Python, please see the code at the end of the message. > The receiving system is OS X with languages priority: Dutch, English, > German, Russian and Hebrew. Viewer: MS Office 2004. Sending system is ...? > > > > You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is > > displayed instead of U+00FC (Latin small letter U with diaeresis). > > > > OK, we'd already guessed your background from your name :-) > > :-) > > > > However, what you see isn't necessarily what you've got. How do you > > know it's not U+0393 (Greek capital letter GAMMA) or something else > > that looks the same? Could even be from a line-drawing set (top left > > corner of a box). What you need to do is find out the ordinal of the > > character being displayed. > > > > This type of problem arises when a character is written in one encoding > > and viewed using another. I've had a quick look through various > > likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251, > > cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of > > encodings that would reproduce anything like your "umlauted-u becomes > > gamma-or-similar" problem. Please supply more than 1 example. > > Thank you very much for your help and explanation! The "Ã" letter is > what I could find till now, see examples above.. > > > > > The following code fragment is used for creating Excel sheet and sending email: > # ################## > # Create Excel sheet > # > f = StringIO() > workbook = xl.Writer(f) > worksheet = workbook.add_worksheet() > bold = workbook.add_format(bold=1) > border = workbook.add_format(border=1) > > worksheet.write_row('A1', ['First Name', 'Last Name'], border) > i = 1 > for row in result: > print repr(row['firstname']) > datarow = [row.get('firstname'), row.get('surname')] You print the repr() of row['firstname'] but pass row.get('firstname') to the XLS writer -- do they have the same value? For believability, print what you pass!! > i += 1 > worksheet.write_row('A%s' % i, datarow) Check the documentation for the XLS writer module to see if it is doing an implicit conversion here. > workbook.close() > #################### > # Create email message > > # Create the container (outer) email message. > msg = MIMEMultipart() > # Attach Excel sheet > xls = MIMEBase('application', 'vnd.ms-excel') > xls.set_payload(f.getvalue()) > Encoders.encode_base64(xls) > xls.add_header('Content-Disposition', 'attachment', filename='some > file name %s-%s-%s.xls' % (today.day, today.month, today.year)) > msg.attach(xls) > > msg['Subject'] = subject > msg['From'] = fromaddr > msg['To'] = toaddr > # Guarantees the message ends in a newline > msg.epilogue = '' > > # Send message > s = smtplib.SMTP(smtp_host) > s.sendmail(fromaddr, to_list, msg.as_string()) > > -- > Ksenia -- http://mail.python.org/mailman/listinfo/python-list