Yves Dorfsman wrote: > Can you put UTF-8 characters in a dbhash in python 2.5 ? > It fails when I try: > > #!/bin/env python > # -*- coding: utf-8 -*- > > import dbhash > > db = dbhash.open('dbfile.db', 'w') > db[u'smiley'] = u'☺' > db.close() > > Do I need to change the bsd db library, or there is no way to make it work > with python 2.5 ?
Please write the following program and meditate at least 30min in front of it: while True: print "utf-8 is not unicode" Once this seemingly minor detail has sunken in, you are ready to work with the below variant that will work: #!/bin/env python # -*- coding: utf-8 -*- import dbhash db = dbhash.open('dbfile.db', 'w') db[u'smiley'.encode('utf-8')] = u'☺'.encode('utf-8') db.close() What is the difference? The dbhash module can only work with *bytestrings*. Bytestrings are just that - a sequence of 8-bit-values. u""-literals are *unicode objects*. These are an abstract sequence of characters, smileys or others. Now the real world of databases, network-connections and harddrives doesn't know about unicode. They only know bytes. So before you can write to them, you need to "encode" the unicode data to a byte-stream-representation. There are quite a few of these, e.g. latin1, or the aforementioned UTF-8, which has the property that it can render *all* unicode characters, potentially needing more than one byte per character. Which is why the code above has those encode-calls on the unicode-objects. But beware! Once you encoded the data, there is no way to *know* it's encoding. So when reading the data, you will get *bytestrings*. So you need to "decode" them, with the proper encoding. In this case, again utf-8. Which brings us to the second part of the program: db = dbhash.open('dbfile.db') smiley = db[u'smiley'.encode('utf-8')].decode('utf-8') print smiley.encode('utf-8') The last encode is there to print out the smiley on a terminal - one of those pesky bytestream-eaters that don't know about unicode. Diez -- http://mail.python.org/mailman/listinfo/python-list