On Nov 29, 8:27 am, Jeff H <[EMAIL PROTECTED]> wrote: > On Nov 28, 2:03 pm, Terry Reedy <[EMAIL PROTECTED]> wrote: > > > > > Jeff H wrote: > > > hashlib.md5 does not appear to like unicode, > > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in > > > position 1650: ordinal not in range(128) > > > It is the (default) ascii encoder that does not like non-ascii chars. > > I suspect that is you encode to bytes first with an encoder that does > > work (latin-???), md5 will be happy. > > > Reports like this should include Python version. > > > > After googling, I've found BDFL and others on Py3K talking about the > > > problems of hashing non-bytes (i.e. buffers) > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html > > > > So what is the canonical way to hash unicode? > > > * convert unicode to local > > > * hash in current local > > > ??? > > > but what if local has ordinals outside of 128? > > > > Is this just a problem for md5 hashes that I would not encounter using > > > a different method? i.e. Should I just use the built-in hash function? > > > -- > > >http://mail.python.org/mailman/listinfo/python-list > > Python v2.52 -- however, this is not really a bug report because your > analysis is correct. I am converting cp1252 strings to unicode before > I persist them in a database. I am looking for advice/direction/ > wisdom on how to sling these strings<g> > > -Jeff
Actually, what I am surprised by, is the fact that hashlib cares at all about the encoding. A md5 hash can be produced for an .iso file which means it can handle bytes, why does it care what it is being fed, as long as there are bytes. I would have assumed that it would take whatever was feed to it and view it as a byte array and then hash it. You can read a binary file and hash it print md5.new(file('foo.iso').read()).hexdigest() What do I need to do to tell hashlib not to try and decode, just treat the data as binary? -- http://mail.python.org/mailman/listinfo/python-list