Neil Hodgson wrote: > JB: > > > as hypens (–) and apostrophes (’) are in an odd encoding. When passed > > to the database using sqlalchemy they appear as – and other > > characters. > > The encoding is UTF-8. Normally the best way to handle encodings is > to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible > and perform most processing in Unicode.
Good advice to work in Unicode (and in Python 3.X str is unicode), but I'd guess the encoding he's getting is "Windows-1252". The default character set of HTTP is ISO-8859-1, but Microsoft likes to use Windows-1252 in it's place. What to do about it? First, try specifying utf-8 in the form containing the textarea, as in <form action="process.cgi" accept-charset="utf-8"> Note that specifying ISO-8859-1 will not work, in that Microsoft will still use Windows-1252. I've heard they've gotten better at supporting utf-8, but I haven't tested. When a request comes in, check for a Content-Type header that names the character set, which should be: Content-Type: application/x-www-form-urlencoded; charset=utf-8 Then you con decode to a unicode object as Neil Hodgson explained. In case you still have to deal with Windows-1252, Python knows how to translate Windows-1252 to the best-fit in Unicode. In current Python 2.x: ustring = unicode(raw_string, 'Windows-1252') In Python 3.X, what comes from a socket is bytes, and str means unicode: ustring = str(raw_bytes, 'Windows-1252') Of course this all assumes that JB's database likes Unicode. If it chokes, then alternatives include encoding back to utf-8 and storing as binary, or translating characters to some best-fit in the set the database supports. -- --Bryan Olson -- http://mail.python.org/mailman/listinfo/python-list