question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
Hi all, I am a Python novice, and right now I would be happy to simply get my job done with it, but I could appreciate some thoughts on the issue below. I need to assign one of four numbers to names in a list. The assignment should be pseudo-random: no pattern whatsoever, but deterministic, reproducible, and close to uniform. My understanding was that hash functions would do the job. As I only needed 2 bits of treatment, I picked a byte of the hashes generated, and even taken mod 4 of it. See the code below. After I have written a short Python script that hashes my textfile line by line and collects the numbers next to the original, I checked what I got. Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I understand that the pseudo-randomness means that the treatments should not be neat and symmetric. Still, this variation is unacceptable for my purpose. My understanding was that good hash functions generate numbers that look completely random, and picking only a byte should not change that. I thought the promise was also to get close to uniformity: http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the hashes in the hashlib module, and picked bytes from the beginning and the end of the hashes, but treatments never were close to uniform (curiously, always the last treatment seems to be too rare). Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected. Thanks! Laszlo The script: #! /usr/bin/python f = open('names.txt', 'r') g = open('nameshashed.txt', 'a') import hashlib for line in f: line = line.rstrip() h = str(hashlib.sha512(line).hexdigest()) s = line + ',' + str(ord(h[64])%4) + '\n' g.write(s), -- http://mail.python.org/mailman/listinfo/python-list
Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
After I have written a short Python script that hashes my textfile line by line and collects the numbers next to the original, I checked what I got. Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. That sounds suspiciously like 25% with a +/- 7% fluctuation one might expect to see from non-random source data. Remember that your outputs are driven purely by your inputs in a deterministic fashion -- if your inputs are purely random, then your outputs should more closely match your expected bin'ing. If your inputs aren't random, you get a taste of your own medicine (my file has just the number 42 on every line...why isn't my output random?). And randomness-of-hash-output is a red herring since hashing is *not* random. Your input is also finite -- an aspect which leaves you a far cry from the full hash-space. If an md5 has 32 bytes (256 bits) of data, your input would have to cover 2**256 possible inputs to see the full profile of your outputs. That's a lot of input :) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
Thank you, Tim. My comments are below. On 2009-08-07 13:19:47 -0400, Tim Chase python.l...@tim.thechases.com said: After I have written a short Python script that hashes my textfile line by line and collects the numbers next to the original, I checked what I got. Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. That sounds suspiciously like 25% with a +/- 7% fluctuation one might expect to see from non-random source data. Could you help me where this range comes from? (If all my lines were 42, I wouldn't hit this range. So it cannot be a rule. Right?) Remember that your outputs are driven purely by your inputs in a deterministic fashion -- if your inputs are purely random, then your outputs should more closely match your expected bin'ing. If your inputs aren't random, you get a taste of your own medicine (my file has just the number 42 on every line...why isn't my output random?). And randomness-of-hash-output is a red herring since hashing is *not* random. Thanks, I tried to be correct with pseudo-random. I understand that everything is dependent on input. I want it to be the case. However, I hoped that good hashes produce random-looking output from input with very little variation. It would be strange if I could not get more than 18% of lines with a remainder of 3 (after division by 4), whatever hash I try just because these are names of people. Your input is also finite -- an aspect which leaves you a far cry from the full hash-space. If an md5 has 32 bytes (256 bits) of data, your input would have to cover 2**256 possible inputs to see the full profile of your outputs. That's a lot of input :) -tkc OK, I understand. Could anyone suggest a better way to do this, then? (Recap: random-looking, close-to uniform assignment of one number out of four possibilities to strings.) Thanks, everyone. Laszlo -- http://mail.python.org/mailman/listinfo/python-list
Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
László Sándor wrote: Thank you, Tim. My comments are below. On 2009-08-07 13:19:47 -0400, Tim Chase python.l...@tim.thechases.com said: After I have written a short Python script that hashes my textfile line by line and collects the numbers next to the original, I checked what I got. Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. That sounds suspiciously like 25% with a +/- 7% fluctuation one might expect to see from non-random source data. Could you help me where this range comes from? (If all my lines were 42, I wouldn't hit this range. So it cannot be a rule. Right?) Remember that your outputs are driven purely by your inputs in a deterministic fashion -- if your inputs are purely random, then your outputs should more closely match your expected bin'ing. If your inputs aren't random, you get a taste of your own medicine (my file has just the number 42 on every line...why isn't my output random?). And randomness-of-hash-output is a red herring since hashing is *not* random. Thanks, I tried to be correct with pseudo-random. I understand that everything is dependent on input. I want it to be the case. However, I hoped that good hashes produce random-looking output from input with very little variation. It would be strange if I could not get more than 18% of lines with a remainder of 3 (after division by 4), whatever hash I try just because these are names of people. Your input is also finite -- an aspect which leaves you a far cry from the full hash-space. If an md5 has 32 bytes (256 bits) of data, your input would have to cover 2**256 possible inputs to see the full profile of your outputs. That's a lot of input :) -tkc OK, I understand. Could anyone suggest a better way to do this, then? (Recap: random-looking, close-to uniform assignment of one number out of four possibilities to strings.) Thanks, everyone. Laszlo Well, a very simplistic method is to use the length of the input string modulus four. If the names have decently random lengths, that may work. ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
L wrote: Hi all, I am a Python novice, and right now I would be happy to simply get my job done with it, but I could appreciate some thoughts on the issue below. I need to assign one of four numbers to names in a list. The assignment should be pseudo-random: no pattern whatsoever, but deterministic, reproducible, and close to uniform. My understanding was that hash functions would do the job. As I only needed 2 bits of treatment, I picked a byte of the hashes generated, and even taken mod 4 of it. See the code below. After I have written a short Python script that hashes my textfile line by line and collects the numbers next to the original, I checked what I got. Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I understand that the pseudo-randomness means that the treatments should not be neat and symmetric. Still, this variation is unacceptable for my purpose. My understanding was that good hash functions generate numbers that look completely random, and picking only a byte should not change that. I thought the promise was also to get close to uniformity: http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the hashes in the hashlib module, and picked bytes from the beginning and the end of the hashes, but treatments never were close to uniform (curiously, always the last treatment seems to be too rare). Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected. Thanks! Laszlo The script: #! /usr/bin/python f = open('names.txt', 'r') g = open('nameshashed.txt', 'a') import hashlib for line in f: line = line.rstrip() h = str(hashlib.sha512(line).hexdigest()) s = line + ',' + str(ord(h[64])%4) + '\n' g.write(s), The problem is indeed simple, but really nothing to do with Python. You're using a single character from a hexdigest, so the only possibilities are 0,1,2,...9,A,B, C, D,E, and F Taking the ords of each of those, and moduloing with 4, you get 4 - 0 5 - 1 4 - 2 3 - 3 In other words, there's a greater than expected likelihood of 1, and less than expected likelihood of 3. I'd figure the distribution ought to be: 25%, 31%, 25%, and 19% I'd use digest() instead of hexdigest(), and of course reduce the subscript to 63 or less. DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
Dave Angel wrote: [clever analysis snipped] I'd use digest() instead of hexdigest(), and of course reduce the subscript to 63 or less. OP: You could also try hash(line) % 4 While AFAIK it doesn't make promises about randomness it might still be good enough in practice. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
László Sándor sand...@gmail.com writes: OK, I understand. Could anyone suggest a better way to do this, then? (Recap: random-looking, close-to uniform assignment of one number out of four possibilities to strings.) Use a cryptographic hash function like md5 (deprecated for security purposes but should be ok for this application). The built-in Python hash function is made for hashing symbols and is designed to avoid collisions when you have a bunch of sequential identifiers (a,b,c...). That is, it deliberately does NOT resemble a random function, which would have a certain number of collisions by chance. -- http://mail.python.org/mailman/listinfo/python-list