After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.

That sounds suspiciously like 25% with a +/- 7% fluctuation one might expect to see from non-random source data.

Remember that your outputs are driven purely by your inputs in a deterministic fashion -- if your inputs are purely random, then your outputs should more closely match your expected bin'ing. If your inputs aren't random, you get a taste of your own medicine ("my file has just the number 42 on every line...why isn't my output random?"). And randomness-of-hash-output is a red herring since hashing is *not* random.

Your input is also finite -- an aspect which leaves you a far cry from the full hash-space. If an md5 has 32 bytes (256 bits) of data, your input would have to cover 2**256 possible inputs to see the full profile of your outputs. That's a lot of input :)

-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to