John, I understand prime numbers and also channel programming. What I don't understand is the mathematical basis for the CAUSE of the clustering around prime factors of a composite modulo. You have explained what happens but not why. I agree with Robin's statement that real data is not random (or, if some prefer, real datums are not random).
E.g., the initial source of data for this discussion was some group of characters that are used in some real, living language. English has 26 letters and 10 numerical digits, so 36 was one possible modulo. If the 36 possible characters are seriously random, then there should be no clustering. Yet we find clustering. I aver that this is caused by quirks in the English language rather than a mathematical cause due simply to the fact that 36 is not prime, and that any clustering of any group of data hashed with any given modulo will always cluster around certain values that are intrinsically related to the underlying non-randomness of the input data. The most commonly used letters in English, in decreasing usage, are approximately ETAIONSHRDLU. If we assign unique values to these letters based on A=0, B=1, ..., Z=25, and then the numbers 0=26, 1=27, ... 9=35, then those same most common 12 letters convert to 4, 19, 0, 8, 14, 13, 18, 7, 17, 3, 11, and 20. Hashing a typical sample of English text will produce significant clustering around these 12 values, but the clustering has nothing to do with the primeness of the modulo value but everything to do with how English has evolved and the sequence of letters in our alphabet, which also is not random but has evolved over 3,000 or so years beginning with the sequence used in the ancient Phoenician alphabet. If we should see clustering around the values 2 and 3, which are prime factors of 36, then we should expect to see also a large percent of English words beginning with or containing the letters C and D. There are many such English words, but there are far more beginning with or containing all five vowels, e.g., none of which is a C or D, the other consonants T, N, S, etc. Another post described using the integers 1 through 1000 for input to hashing by dividing by 36. It seems to me that the sequence of modulo values produced will be exactly 27 sets of 1, 2, 3, ..., 35, 0 and then one final set having the values 1, 2, 3..., 27, 28. I don't see any evidence of clustering here, nor any reason why there would be 84 occurrences of any one value. Rather than belabor this issue any further, if you could post a link to some explanation of the mathematical proof why clustering occurs around prime factors of a composite modulo, I would love to read it and try to understand what is going on. Bill Fairchild Programmer Rocket Software 408 Chamberlain Park Lane . Franklin, TN 37069-2526 . USA t: +1.617.614.4503 . e: bfairch...@rocketsoftware.com . w: www.rocketsoftware.com -----Original Message----- From: IBM Mainframe Assembler List [mailto:ASSEMBLER-LIST@LISTSERV.UGA.EDU] On Behalf Of John Gilmore Sent: Friday, November 02, 2012 8:41 AM To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Subject: Re: Curosity Question Martin Truebner (Trübner?) writes: <begin snippet> and I insist: the number of possible input values is 46656 and this is the number of output values exact 46656 (not more-fewer but identical) <end snippet> It is of course immediate that for a set of n characters the number of permutations of them taken m at a time is n^m and that, in particular, 36^3 = 46656. If one calculates N hash values one has or at least at some point had N hash values. The obvious disposed of, what is of interest here is the distribution of these values. Consider the function f(i) = mod(i,2) for i = 0, 1, 2, 3, . . . Evaluating it for the first N elements of this sequence certainly yields N values, but what is interesting about them is not that there are N of them but that, crudely, half of them are 0 and half of them are 1. Now one can of course arrange things otherwise. The function f(i) = mod(2i + 1,2), i = 0, 1, 2, 3, . . . always has the value 1, and the function f(i) = mod(2i,2), i = 0, 1, 2, 3, . . . always has the value 0. I entered this discussion because an examination of clustering side effects is necessary for any hashing scheme. They may be important, or again they may not be, but it is essential that they be evaluated. What I got for my trouble was a lot of guff, much of it specious and some of it literal nonsense, from people who lack a minimal grasp of the issues involved. I shall take this to heart. In the future I shall avoid discussing mathematical topics, in which I have some competence, here just as I avoid discussing channel-programming topics, in which I have none. --jg