Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 29, 12:23 pm, Scott David Daniels <[EMAIL PROTECTED]>
wrote:
> Scott David Daniels wrote:
>
> ...
>
> > If you now, and for all time, decide that the only source you will take
> > is cp1252, perhaps you should decode to cp1252 before hashing.
>
> Of course my dyslexia sticks out here as I get encode and decode exactly
> backwards -- Marc 'BlackJack' Rintsch has it right.
>
> Characters (a concept) are "encoded" to a byte format (representation).
> Bytes (a precise representation) are "decoded" to characters (a format
> with semantics).
>
> --Scott David Daniels
> [EMAIL PROTECTED]

Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
>128 - shhh'boom.  So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)

>>> a='André'
>>> b=unicode(a,'cp1252')
>>> b
u'Andr\xc3\xa9'
>>> hashlib.md5(b.encode('utf-8')).hexdigest()
'b4e5418a36bc4badfc47deb657a2b50c'

Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.

decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set

Please correct me if you see something wrong and thank you for your
advice and direction.

u'unicordial-ly yours. ;)'
Jeff
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 29, 8:27 am, Jeff H <[EMAIL PROTECTED]> wrote:
> On Nov 28, 2:03 pm, Terry Reedy <[EMAIL PROTECTED]> wrote:
>
>
>
> > Jeff H wrote:
> > > hashlib.md5 does not appear to like unicode,
> > >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > > position 1650: ordinal not in range(128)
>
> > It is the (default) ascii encoder that does not like non-ascii chars.
> > I suspect that is you encode to bytes first with an encoder that does
> > work (latin-???), md5 will be happy.
>
> > Reports like this should include Python version.
>
> > > After googling, I've found BDFL and others on Py3K talking about the
> > > problems of hashing non-bytes (i.e. buffers)
> > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html
>
> > > So what is the canonical way to hash unicode?
> > >  * convert unicode to local
> > >  * hash in current local
> > > ???
> > > but what if local has ordinals outside of 128?
>
> > > Is this just a problem for md5 hashes that I would not encounter using
> > > a different method?  i.e. Should I just use the built-in hash function?
> > > --
> > >http://mail.python.org/mailman/listinfo/python-list
>
> Python v2.52 -- however, this is not really a bug report because your
> analysis is correct. I am converting cp1252 strings to unicode before
> I persist them in a database.  I am looking for advice/direction/
> wisdom on how to sling these strings
>
> -Jeff

Actually, what I am surprised by, is the fact that hashlib cares at
all about the encoding.  A md5 hash can be produced for an .iso file
which means it can handle bytes, why does it care what it is being
fed, as long as there are bytes.  I would have assumed that it would
take whatever was feed to it and view it as a byte array and then hash
it.  You can read a binary file and hash it
  print md5.new(file('foo.iso').read()).hexdigest()
What do I need to do to tell hashlib not to try and decode, just treat
the data as binary?

--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 28, 2:03 pm, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Jeff H wrote:
> > hashlib.md5 does not appear to like unicode,
> >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > position 1650: ordinal not in range(128)
>
> It is the (default) ascii encoder that does not like non-ascii chars.
> I suspect that is you encode to bytes first with an encoder that does
> work (latin-???), md5 will be happy.
>
> Reports like this should include Python version.
>
> > After googling, I've found BDFL and others on Py3K talking about the
> > problems of hashing non-bytes (i.e. buffers)
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html
>
> > So what is the canonical way to hash unicode?
> >  * convert unicode to local
> >  * hash in current local
> > ???
> > but what if local has ordinals outside of 128?
>
> > Is this just a problem for md5 hashes that I would not encounter using
> > a different method?  i.e. Should I just use the built-in hash function?
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
>

Python v2.52 -- however, this is not really a bug report because your
analysis is correct. I am converting cp1252 strings to unicode before
I persist them in a database.  I am looking for advice/direction/
wisdom on how to sling these strings

-Jeff
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 28, 1:24 pm, Scott David Daniels <[EMAIL PROTECTED]> wrote:
> Jeff H wrote:
> > hashlib.md5 does not appear to like unicode,
> >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > position 1650: ordinal not in range(128)
>
> > After googling, I've found BDFL and others on Py3K talking about the
> > problems of hashing non-bytes (i.e. buffers) ...
>
> Unicode is characters, not a character encoding.
> You could hash on a utf-8 encoding of the Unicode.
>
> > So what is the canonical way to hash unicode?
> >  * convert unicode to local
> >  * hash in current local
> > ???
>
> There is no _the_ way to hash Unicode, any more than
> there is no _the_ way to hash vectors.  You need to
> convert the abstract entity something concrete with
> a well-defined representation in bytes, and hash that.
>
> > Is this just a problem for md5 hashes that I would not encounter using
> > a different method?  i.e. Should I just use the built-in hash function?
>
> No, it is a definitional problem.  Perhaps you could explain how you
> want to use the hash.  If the internal hash is acceptable (e.g. for
> grouping in dictionaries within a single run), use that.  If you intend
> to store and compare on the same system, say that.  If you want cross-
> platform execution of your code to produce the same hashes, say that.
> A hash is a means to an end, and it is hard to give advice without
> knowing the goal.
>
I am checking for changes to large text objects stored in a database
against outside sources. So the hash needs to be reproducible/stable.

> --Scott David Daniels
> [EMAIL PROTECTED]

--
http://mail.python.org/mailman/listinfo/python-list


unicode and hashlib

2008-11-28 Thread Jeff H
hashlib.md5 does not appear to like unicode,
  UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers)
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html

So what is the canonical way to hash unicode?
 * convert unicode to local
 * hash in current local
???
but what if local has ordinals outside of 128?

Is this just a problem for md5 hashes that I would not encounter using
a different method?  i.e. Should I just use the built-in hash function?
--
http://mail.python.org/mailman/listinfo/python-list