Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

Tim Chase Fri, 07 Aug 2009 10:21:30 -0700

After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.

That sounds suspiciously like 25% with a +/- 7% fluctuation onemight expect to see from non-random source data.

Remember that your outputs are driven purely by your inputs in adeterministic fashion -- if your inputs are purely random, thenyour outputs should more closely match your expected bin'ing. Ifyour inputs aren't random, you get a taste of your own medicine("my file has just the number 42 on every line...why isn't myoutput random?"). And randomness-of-hash-output is a red herringsince hashing is *not* random.

Your input is also finite -- an aspect which leaves you a far cryfrom the full hash-space. If an md5 has 32 bytes (256 bits) ofdata, your input would have to cover 2**256 possible inputs tosee the full profile of your outputs. That's a lot of input :)


-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Re: question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

Reply via email to