Hi all, I am a Python novice, and right now I would be happy to simply get my job done with it, but I could appreciate some thoughts on the issue below.
I need to assign one of four numbers to names in a list. The assignment should be pseudo-random: no pattern whatsoever, but deterministic, reproducible, and close to uniform. My understanding was that hash functions would do the job. As I only needed 2 bits of treatment, I picked a byte of the hashes generated, and even taken mod 4 of it. See the code below. After I have written a short Python script that hashes my textfile line by line and collects the numbers next to the original, I checked what I got. Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I understand that the pseudo-randomness means that the treatments should not be neat and symmetric. Still, this variation is unacceptable for my purpose. My understanding was that good hash functions generate numbers that look completely random, and picking only a byte should not change that. I thought the promise was also to get close to uniformity: http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the hashes in the hashlib module, and picked bytes from the beginning and the end of the hashes, but treatments never were close to uniform (curiously, always the last treatment seems to be too rare). Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected. Thanks! Laszlo The script: #! /usr/bin/python f = open('names.txt', 'r') g = open('nameshashed.txt', 'a') import hashlib for line in f: line = line.rstrip() h = str(hashlib.sha512(line).hexdigest()) s = line + ',' + str(ord(h[64])%4) + '\n' g.write(s),
-- http://mail.python.org/mailman/listinfo/python-list