On 16 Feb 2002, at 15:42, George Woltman wrote: > To put it in perspective, let's say you are looking for 64 bit factors > of a 10,000,000 bit number. > > The basic algorithm is: > Multiply 156,250 trial factors together to form a 10,000,000 bit > number. > do { > Multiply next 156,250 trial factors making 2nd 10M bit number. > Multiply the two 10,000,000 bit numbers together mod 2^P-1 > } while there are more 64-bit factors > GCD (10,000,000 bit product, 2^P-1)
My suggested modification groups things a bit differently: by and large we're working in powers of 2, so the first step of the inner loop would go something like Multiply 131072 64-bit numbers together in pairs Multiply the resulting 65536 128-bit products together in pairs ... Multiply the resulting four 4194404-bit products together in pairs Multiply the resulting two 8388608-bit products together modulo 2^p-1 Repeat the above for the next block of 131072 64-bit numbers Multiply the resulting two 10M-bit products together and reduce modulo 2^P-1 etc. etc. So for each block of 131072 numbers we have 65536 64-bit multiplications 32768 128-bit multiplications 16384 256-bit multiplications 8192 512-bit multiplications 4096 1024-bit multiplications 2048 2048-bit multiplications 1024 8192-bit multiplications 512 16384-bit multiplications 256 32768-bit multiplications 128 65536-bit multiplications 64 131072-bit multiplications 32 262144-bit multiplications 16 524288-bit multiplications 8 1048576-bit multiplications 4 2097152-bit multiplications 2 4194304-bit multiplications and one 8388608 x 10M bit multiplication modulo 2^p-1 There seems to be about the same amount of work in each level, nevertheless the lower levels should be faster since even FFT is O(n log n) and we can probably do better than FFT for the very smallest products. > > Looking at the inner loop the second step uses the fast LL multiply code. > On my P4-1400 this takes 0.06 seconds or 84 million clocks or (dividing > by 156,250) 538 clocks per trial factor. This is quite promising. I haven't > measured it but the current code can test one trial factor in about 4000 > clocks. My scheme has 17 levels so the upper bound of the _average_ time per trial factor should be 17 * 538 clocks. This is too darned slow but, as I indicated above, the actual execution time should be significantly less than that. But I thought the time to actually test a 64-bit factor - not counting sieving overheads - was less than 2000 clocks ??? > > The stumbling block is how can we quickly do the first step of the inner > loop - multiplying 156,250 bit trial factors together. If someone has a > clever algorithm that can do this in fewer than 3500 clocks per trial factor > then we may have an improved factoring algorithm! The other point I was trying to make is that a similar algorithm could be used to compute the product of small primes used by P-1 stage 1 (and maybe stage 2 as well) with a highly significant improvement over the current method. There may also be an application in ECM. Regards Brian Beesley Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers