thebjorn wrote: > I've got a database (ms sqlserver) that's (way) out of my control, > where someone has stored utf-8 encoded Unicode data in regular varchar > fields, so that e.g. the string 'Blåbærsyltetøy' is in the database > as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/ > > Then I read the data out using adodbapi (which returns all strings as > Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't > find any way to get back to the original short of: > > def unfk(s): > return eval(repr(s)[1:]).decode('utf-8') > > i.e. chopping off the u in the repr of a unicode string, and relying on > eval to interpret the \xHH sequences. > > Is there a less hack'ish way to do this?
first, check if you can get your database adapter to understand that the database contains UTF-8 and not ISO-8859-1. if that's not possible, you can roundtrip via ISO-8859-1 yourself: >>> u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' >>> u u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' >>> u.encode("iso-8859-1") 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' >>> u.encode("iso-8859-1").decode("utf-8") u'Bl\xe5b\xe6rsyltet\xf8y' >>> print u.encode("iso-8859-1").decode("utf-8") Blåbærsyltetøy </F> -- http://mail.python.org/mailman/listinfo/python-list