David Pratt wrote: > I want to prepare strings for db storage that come from normal Windows > machine (cp1252) so my understanding is to unicode and encode to utf-8 > and to store properly.
That also depends on the database. The database must accept UTF-8-encoded strings, and must not modify them in any form or way. Some databases fail here, and work better if you pass Unicode objects to them directly. > Since data will be used on the web I would not > have to change my encoding when extracting from the database. This first > example I believe simulates this with the 3/4 symbol. Here I want tox > store '\xc2\xbe' in my database. > >>>> tq = u'\xbe' You can verify that this is really 3/4: py> import unicodedata py> unicodedata.name(u"\xbe") 'VULGAR FRACTION THREE QUARTERS' >>>> tq_utf = tq.encode('utf8') >>>> tq, tq_utf > (u'\xbe', '\xc2\xbe') So it should be clear now that '\xc2\xbe' is the UTF-8 encoding of that character. > To unicode withat a valiable, my understanding is that I can unicode and > encode at the same time Not sure what you mean by "same time" (I'm not even sure what "I can unicode" means - unicode is not a verb, it's a noun). >>>> tq = '\xbe' >>>> tq_utf = unicode(tq, 'utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 0: > unexpected code byte > > This is not working for me. Can someone explain why. Many thanks. Of course not. The UTF-8 encoding of the character, as we have seen earlier, is '\xc2\xbe'. So you should write py> unicode('\xc2\xbe', 'utf-8') u'\xbe' You mentioned windows-1252 at some point. If you are given windows-1252 bytes, you can do py> unicode('\xbe', 'windows-1252') u'\xbe' If you are looking for "at the same time", perhaps this is also interesting: py> unicode('\xbe', 'windows-1252').encode('utf-8') '\xc2\xbe' Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list