On Nov 29, 12:23 pm, Scott David Daniels <[EMAIL PROTECTED]> wrote: > Scott David Daniels wrote: > > ... > > > If you now, and for all time, decide that the only source you will take > > is cp1252, perhaps you should decode to cp1252 before hashing. > > Of course my dyslexia sticks out here as I get encode and decode exactly > backwards -- Marc 'BlackJack' Rintsch has it right. > > Characters (a concept) are "encoded" to a byte format (representation). > Bytes (a precise representation) are "decoded" to characters (a format > with semantics). > > --Scott David Daniels > [EMAIL PROTECTED]
Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize that the hashlib was trying to encode (not decode) my unicode object as 'ascii' (my default encoding) and since that resulted in characters >128 - shhh'boom. So once I have character strings transformed internally to unicode objects, I should encode them in 'utf-8' before attempting to do things that guess at the proper way to encode them for further processing.(i.e. hashlib) >>> a='André' >>> b=unicode(a,'cp1252') >>> b u'Andr\xc3\xa9' >>> hashlib.md5(b.encode('utf-8')).hexdigest() 'b4e5418a36bc4badfc47deb657a2b50c' Scott then points out that utf-8 is probably superior (for use within the code I control) to utf-16 and utf-32 which both have 2 variants and sometimes which one used is based on installed software and/or processors. utf-8 unlike -16/-32 stays reliable and reproducible irrespective of software or hardware. decode vs encode You decode from on character set to a unicode object You encode from a unicode object to a specifed character set Please correct me if you see something wrong and thank you for your advice and direction. u'unicordial-ly yours. ;)' Jeff -- http://mail.python.org/mailman/listinfo/python-list