Re: unicode and hashlib
On Nov 29, 12:23 pm, Scott David Daniels <[EMAIL PROTECTED]> wrote: > Scott David Daniels wrote: > > ... > > > If you now, and for all time, decide that the only source you will take > > is cp1252, perhaps you should decode to cp1252 before hashing. > > Of course my dyslexia sticks out here as I get encode and decode exactly > backwards -- Marc 'BlackJack' Rintsch has it right. > > Characters (a concept) are "encoded" to a byte format (representation). > Bytes (a precise representation) are "decoded" to characters (a format > with semantics). > > --Scott David Daniels > [EMAIL PROTECTED] Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize that the hashlib was trying to encode (not decode) my unicode object as 'ascii' (my default encoding) and since that resulted in characters >128 - shhh'boom. So once I have character strings transformed internally to unicode objects, I should encode them in 'utf-8' before attempting to do things that guess at the proper way to encode them for further processing.(i.e. hashlib) >>> a='André' >>> b=unicode(a,'cp1252') >>> b u'Andr\xc3\xa9' >>> hashlib.md5(b.encode('utf-8')).hexdigest() 'b4e5418a36bc4badfc47deb657a2b50c' Scott then points out that utf-8 is probably superior (for use within the code I control) to utf-16 and utf-32 which both have 2 variants and sometimes which one used is based on installed software and/or processors. utf-8 unlike -16/-32 stays reliable and reproducible irrespective of software or hardware. decode vs encode You decode from on character set to a unicode object You encode from a unicode object to a specifed character set Please correct me if you see something wrong and thank you for your advice and direction. u'unicordial-ly yours. ;)' Jeff -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 29, 8:27 am, Jeff H <[EMAIL PROTECTED]> wrote: > On Nov 28, 2:03 pm, Terry Reedy <[EMAIL PROTECTED]> wrote: > > > > > Jeff H wrote: > > > hashlib.md5 does not appear to like unicode, > > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in > > > position 1650: ordinal not in range(128) > > > It is the (default) ascii encoder that does not like non-ascii chars. > > I suspect that is you encode to bytes first with an encoder that does > > work (latin-???), md5 will be happy. > > > Reports like this should include Python version. > > > > After googling, I've found BDFL and others on Py3K talking about the > > > problems of hashing non-bytes (i.e. buffers) > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html > > > > So what is the canonical way to hash unicode? > > > * convert unicode to local > > > * hash in current local > > > ??? > > > but what if local has ordinals outside of 128? > > > > Is this just a problem for md5 hashes that I would not encounter using > > > a different method? i.e. Should I just use the built-in hash function? > > > -- > > >http://mail.python.org/mailman/listinfo/python-list > > Python v2.52 -- however, this is not really a bug report because your > analysis is correct. I am converting cp1252 strings to unicode before > I persist them in a database. I am looking for advice/direction/ > wisdom on how to sling these strings > > -Jeff Actually, what I am surprised by, is the fact that hashlib cares at all about the encoding. A md5 hash can be produced for an .iso file which means it can handle bytes, why does it care what it is being fed, as long as there are bytes. I would have assumed that it would take whatever was feed to it and view it as a byte array and then hash it. You can read a binary file and hash it print md5.new(file('foo.iso').read()).hexdigest() What do I need to do to tell hashlib not to try and decode, just treat the data as binary? -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 28, 2:03 pm, Terry Reedy <[EMAIL PROTECTED]> wrote: > Jeff H wrote: > > hashlib.md5 does not appear to like unicode, > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in > > position 1650: ordinal not in range(128) > > It is the (default) ascii encoder that does not like non-ascii chars. > I suspect that is you encode to bytes first with an encoder that does > work (latin-???), md5 will be happy. > > Reports like this should include Python version. > > > After googling, I've found BDFL and others on Py3K talking about the > > problems of hashing non-bytes (i.e. buffers) > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html > > > So what is the canonical way to hash unicode? > > * convert unicode to local > > * hash in current local > > ??? > > but what if local has ordinals outside of 128? > > > Is this just a problem for md5 hashes that I would not encounter using > > a different method? i.e. Should I just use the built-in hash function? > > -- > >http://mail.python.org/mailman/listinfo/python-list > > Python v2.52 -- however, this is not really a bug report because your analysis is correct. I am converting cp1252 strings to unicode before I persist them in a database. I am looking for advice/direction/ wisdom on how to sling these strings -Jeff -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 28, 1:24 pm, Scott David Daniels <[EMAIL PROTECTED]> wrote: > Jeff H wrote: > > hashlib.md5 does not appear to like unicode, > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in > > position 1650: ordinal not in range(128) > > > After googling, I've found BDFL and others on Py3K talking about the > > problems of hashing non-bytes (i.e. buffers) ... > > Unicode is characters, not a character encoding. > You could hash on a utf-8 encoding of the Unicode. > > > So what is the canonical way to hash unicode? > > * convert unicode to local > > * hash in current local > > ??? > > There is no _the_ way to hash Unicode, any more than > there is no _the_ way to hash vectors. You need to > convert the abstract entity something concrete with > a well-defined representation in bytes, and hash that. > > > Is this just a problem for md5 hashes that I would not encounter using > > a different method? i.e. Should I just use the built-in hash function? > > No, it is a definitional problem. Perhaps you could explain how you > want to use the hash. If the internal hash is acceptable (e.g. for > grouping in dictionaries within a single run), use that. If you intend > to store and compare on the same system, say that. If you want cross- > platform execution of your code to produce the same hashes, say that. > A hash is a means to an end, and it is hard to give advice without > knowing the goal. > I am checking for changes to large text objects stored in a database against outside sources. So the hash needs to be reproducible/stable. > --Scott David Daniels > [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
unicode and hashlib
hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? but what if local has ordinals outside of 128? Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? -- http://mail.python.org/mailman/listinfo/python-list