John,

I understand prime numbers and also channel programming.  What I don't 
understand is the mathematical basis for the CAUSE of the clustering around 
prime factors of a composite modulo.  You have explained what happens but not 
why.  I agree with Robin's statement that real data is not random (or, if some 
prefer, real datums are not random).

E.g., the initial source of data for this discussion was some group of 
characters that are used in some real, living language.  English has 26 letters 
and 10 numerical digits, so 36 was one possible modulo.  If the 36 possible 
characters are seriously random, then there should be no clustering.  Yet we 
find clustering.  I aver that this is caused by quirks in the English language 
rather than a mathematical cause due simply to the fact that 36 is not prime, 
and that any clustering of any group of data hashed with any given modulo will 
always cluster around certain values that are intrinsically related to the 
underlying non-randomness of the input data.

The most commonly used letters in English, in decreasing usage, are 
approximately ETAIONSHRDLU.  If we assign unique values to these letters based 
on A=0, B=1, ..., Z=25, and then the numbers 0=26, 1=27, ... 9=35, then those 
same most common 12 letters convert to 4, 19, 0, 8, 14, 13, 18, 7, 17, 3, 11, 
and 20.  Hashing a typical sample of English text will produce significant 
clustering around these 12 values, but the clustering has nothing to do with 
the primeness of the modulo value but everything to do with how English has 
evolved and the sequence of letters in our alphabet, which also is not random 
but has evolved over 3,000 or so years beginning with the sequence used in the 
ancient Phoenician alphabet.

If we should see clustering around the values 2 and 3, which are prime factors 
of 36, then we should expect to see also a large percent of English words 
beginning with or containing the letters C and D.  There are many such English 
words, but there are far more beginning with or containing all five vowels, 
e.g., none of which is a C or D, the other consonants T, N, S, etc.

Another post described using the integers 1 through 1000 for input to hashing 
by dividing by 36.  It seems to me that the sequence of modulo values produced 
will be exactly 27 sets of 1, 2, 3, ..., 35, 0 and then one final set having 
the values 1, 2, 3..., 27, 28.  I don't see any evidence of clustering here, 
nor any reason why there would be 84 occurrences of any one value.

Rather than belabor this issue any further, if you could post a link to some 
explanation of the mathematical proof why clustering occurs around prime 
factors of a composite modulo, I would love to read it and try to understand 
what is going on.

Bill Fairchild
Programmer
Rocket Software
408 Chamberlain Park Lane . Franklin, TN 37069-2526 . USA
t: +1.617.614.4503 .  e: bfairch...@rocketsoftware.com . w: 
www.rocketsoftware.com


-----Original Message-----
From: IBM Mainframe Assembler List [mailto:ASSEMBLER-LIST@LISTSERV.UGA.EDU] On 
Behalf Of John Gilmore
Sent: Friday, November 02, 2012 8:41 AM
To: ASSEMBLER-LIST@LISTSERV.UGA.EDU
Subject: Re: Curosity Question

Martin Truebner (Trübner?) writes:

<begin snippet>
and I insist: the number of possible input values is 46656 and this is the 
number of output values exact 46656 (not more-fewer but identical) <end snippet>

It is of course immediate that for a set of n characters the number of 
permutations of them taken m at a  time is n^m and that, in particular, 36^3 = 
46656.

If one calculates N hash values one has or at least at some point had N hash 
values.  The obvious disposed of, what is of interest here is the distribution 
of these values.

Consider the function f(i) = mod(i,2) for i = 0, 1, 2, 3,  . . .
Evaluating it for the first N elements of this sequence certainly yields N 
values, but what is interesting about them is not that there are N of them but 
that, crudely, half of them are 0 and half of them are 1.

Now one can of course arrange things otherwise.  The function

f(i) = mod(2i + 1,2), i = 0, 1, 2, 3, . . .

always has the value 1, and the function

f(i) = mod(2i,2), i = 0, 1, 2, 3, . . .

always has the value 0.

I entered this discussion because an examination of clustering side effects is 
necessary for any hashing scheme.  They may be important, or again they may not 
be, but it is essential that they be evaluated.

What I got for my trouble was a lot of guff, much of it specious and some of it 
literal nonsense, from people who lack a minimal grasp of the issues involved.  
I shall take this to heart.  In the future I shall avoid discussing 
mathematical topics, in which I have some competence, here just as I avoid 
discussing channel-programming topics, in which I have none.

--jg

Reply via email to