question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread László Sándor
Hi all,
I am a Python novice, and right now I would be happy to simply get my job
done with it, but I could appreciate some thoughts on the issue below.

I need to assign one of four numbers to names in a list. The assignment
should be pseudo-random: no pattern whatsoever, but deterministic,
reproducible, and close to uniform. My understanding was that hash functions
would do the job. As I only needed 2 bits of treatment, I picked a byte of
the hashes generated, and even taken mod 4 of it. See the code below.

After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I
understand that the pseudo-randomness means that the treatments should not
be neat and symmetric. Still, this variation is unacceptable for my purpose.
My understanding was that good hash functions generate numbers that look
completely random, and picking only a byte should not change that. I thought
the promise was also to get close to uniformity:
http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the
hashes in the hashlib module, and picked bytes from the beginning and the
end of the hashes, but treatments never were close to uniform (curiously,
always the last treatment seems to be too rare).

Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected.

Thanks!

Laszlo

The script:

#! /usr/bin/python

f = open('names.txt', 'r')
g = open('nameshashed.txt', 'a')

import hashlib

for line in f:
line = line.rstrip()
h = str(hashlib.sha512(line).hexdigest())
s = line + ',' + str(ord(h[64])%4) + '\n'
g.write(s),
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread Tim Chase

After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.


That sounds suspiciously like 25% with a +/- 7% fluctuation one 
might expect to see from non-random source data.


Remember that your outputs are driven purely by your inputs in a 
deterministic fashion -- if your inputs are purely random, then 
your outputs should more closely match your expected bin'ing.  If 
your inputs aren't random, you get a taste of your own medicine 
(my file has just the number 42 on every line...why isn't my 
output random?).  And randomness-of-hash-output is a red herring 
since hashing is *not* random.


Your input is also finite -- an aspect which leaves you a far cry 
from the full hash-space.  If an md5 has 32 bytes (256 bits) of 
data, your input would have to cover 2**256 possible inputs to 
see the full profile of your outputs.  That's a lot of input :)


-tkc




--
http://mail.python.org/mailman/listinfo/python-list


Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread László Sándor

Thank you, Tim. My comments are below.

On 2009-08-07 13:19:47 -0400, Tim Chase python.l...@tim.thechases.com said:


After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.


That sounds suspiciously like 25% with a +/- 7% fluctuation one might 
expect to see from non-random source data.


Could you help me where this range comes from? (If all my lines were 
42, I wouldn't hit this range. So it cannot be a rule. Right?)




Remember that your outputs are driven purely by your inputs in a 
deterministic fashion -- if your inputs are purely random, then your 
outputs should more closely match your expected bin'ing.  If your 
inputs aren't random, you get a taste of your own medicine (my file 
has just the number 42 on every line...why isn't my output random?).  
And randomness-of-hash-output is a red herring since hashing is *not* 
random.


Thanks, I tried to be correct with pseudo-random. I understand that 
everything is dependent on input. I want it to be the case. However, I 
hoped that good hashes produce random-looking output from input with 
very little variation. It would be strange if I could not get more than 
18% of lines with a remainder of 3 (after division by 4), whatever hash 
I try just because these are names of people.




Your input is also finite -- an aspect which leaves you a far cry from 
the full hash-space.  If an md5 has 32 bytes (256 bits) of data, your 
input would have to cover 2**256 possible inputs to see the full 
profile of your outputs.  That's a lot of input :)


-tkc


OK, I understand. Could anyone suggest a better way to do this, then?

(Recap: random-looking, close-to uniform assignment of one number out 
of four possibilities to strings.)


Thanks, everyone.

Laszlo

--
http://mail.python.org/mailman/listinfo/python-list


Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread Ethan Furman

László Sándor wrote:

Thank you, Tim. My comments are below.

On 2009-08-07 13:19:47 -0400, Tim Chase python.l...@tim.thechases.com 
said:


After I have written a short Python script that hashes my textfile 
line by
line and collects the numbers next to the original, I checked what I 
got.
Instead of getting around 25% in each treatment, the range is 
17.8%-31.3%.



That sounds suspiciously like 25% with a +/- 7% fluctuation one might 
expect to see from non-random source data.



Could you help me where this range comes from? (If all my lines were 
42, I wouldn't hit this range. So it cannot be a rule. Right?)




Remember that your outputs are driven purely by your inputs in a 
deterministic fashion -- if your inputs are purely random, then your 
outputs should more closely match your expected bin'ing.  If your 
inputs aren't random, you get a taste of your own medicine (my file 
has just the number 42 on every line...why isn't my output random?).  
And randomness-of-hash-output is a red herring since hashing is *not* 
random.



Thanks, I tried to be correct with pseudo-random. I understand that 
everything is dependent on input. I want it to be the case. However, I 
hoped that good hashes produce random-looking output from input with 
very little variation. It would be strange if I could not get more than 
18% of lines with a remainder of 3 (after division by 4), whatever hash 
I try just because these are names of people.




Your input is also finite -- an aspect which leaves you a far cry from 
the full hash-space.  If an md5 has 32 bytes (256 bits) of data, your 
input would have to cover 2**256 possible inputs to see the full 
profile of your outputs.  That's a lot of input :)


-tkc



OK, I understand. Could anyone suggest a better way to do this, then?

(Recap: random-looking, close-to uniform assignment of one number out of 
four possibilities to strings.)


Thanks, everyone.

Laszlo



Well,  a very simplistic method is to use the length of the input string 
modulus four.  If the names have decently random lengths, that may work.


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list


Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread Dave Angel

L wrote:

Hi all,
I am a Python novice, and right now I would be happy to simply get my job
done with it, but I could appreciate some thoughts on the issue below.

I need to assign one of four numbers to names in a list. The assignment
should be pseudo-random: no pattern whatsoever, but deterministic,
reproducible, and close to uniform. My understanding was that hash functions
would do the job. As I only needed 2 bits of treatment, I picked a byte of
the hashes generated, and even taken mod 4 of it. See the code below.

After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I
understand that the pseudo-randomness means that the treatments should not
be neat and symmetric. Still, this variation is unacceptable for my purpose.
My understanding was that good hash functions generate numbers that look
completely random, and picking only a byte should not change that. I thought
the promise was also to get close to uniformity:
http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the
hashes in the hashlib module, and picked bytes from the beginning and the
end of the hashes, but treatments never were close to uniform (curiously,
always the last treatment seems to be too rare).

Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected.

Thanks!

Laszlo

The script:

#! /usr/bin/python

f = open('names.txt', 'r')
g = open('nameshashed.txt', 'a')

import hashlib

for line in f:
line = line.rstrip()
h = str(hashlib.sha512(line).hexdigest())
s = line + ',' + str(ord(h[64])%4) + '\n'
g.write(s),

  
The problem is indeed simple, but really nothing to do with Python.  
You're using a single character from a hexdigest, so the only 
possibilities are 0,1,2,...9,A,B, C, D,E, and F


Taking the ords of each of those, and moduloing with 4, you get
4 - 0
5 - 1
4 - 2
3 - 3

In other words, there's a greater than expected likelihood of 1, and 
less than expected likelihood of 3.  I'd figure the distribution ought 
to be:

25%,  31%, 25%, and 19%

I'd use digest() instead of hexdigest(), and of course reduce the 
subscript  to 63 or less.


DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread Peter Otten
Dave Angel wrote:

[clever analysis snipped]

 I'd use digest() instead of hexdigest(), and of course reduce the
 subscript  to 63 or less.

OP: You could also try

hash(line) % 4

While AFAIK it doesn't make promises about randomness it might still be good 
enough in practice.

Peter


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

2009-08-07 Thread Paul Rubin
László Sándor sand...@gmail.com writes:
 OK, I understand. Could anyone suggest a better way to do this, then?
 
 (Recap: random-looking, close-to uniform assignment of one number out
 of four possibilities to strings.)

Use a cryptographic hash function like md5 (deprecated for security
purposes but should be ok for this application).  The built-in Python
hash function is made for hashing symbols and is designed to avoid
collisions when you have a bunch of sequential identifiers (a,b,c...).
That is, it deliberately does NOT resemble a random function, which
would have a certain number of collisions by chance.
-- 
http://mail.python.org/mailman/listinfo/python-list