On 16 Feb 2002, at 15:42, George Woltman wrote:

> To put it in perspective, let's say you are looking for 64 bit factors
> of a 10,000,000 bit number.
> 
> The basic algorithm is:
>          Multiply 156,250 trial factors together to form a 10,000,000 bit 
> number.
>          do {
>                  Multiply next 156,250 trial factors making 2nd 10M bit number.
>                  Multiply the two 10,000,000 bit numbers together mod 2^P-1
>          } while there are more 64-bit factors
>          GCD (10,000,000 bit product, 2^P-1)

My suggested modification groups things a bit differently: by and 
large we're working in powers of 2, so the first step of the inner loop 
would go something like

Multiply 131072 64-bit numbers together in pairs
Multiply the resulting 65536 128-bit products together in pairs
...
Multiply the resulting four 4194404-bit products together in pairs
Multiply the resulting two 8388608-bit products together modulo 
2^p-1

Repeat the above for the next block of 131072 64-bit numbers

Multiply the resulting two 10M-bit products together and reduce 
modulo 2^P-1

etc. etc.

So for each block of 131072 numbers we have
65536 64-bit multiplications
32768 128-bit multiplications
16384 256-bit multiplications
8192 512-bit multiplications
4096 1024-bit multiplications
2048 2048-bit multiplications
1024 8192-bit multiplications
512 16384-bit multiplications
256 32768-bit multiplications
128 65536-bit multiplications
64 131072-bit multiplications
32 262144-bit multiplications
16 524288-bit multiplications
8 1048576-bit multiplications
4 2097152-bit multiplications
2 4194304-bit multiplications
and one 8388608 x 10M bit multiplication modulo 2^p-1

There seems to be about the same amount of work in each level, 
nevertheless the lower levels should be faster since even FFT is 
O(n log n) and we can probably do better than FFT for the very 
smallest products.
> 
> Looking at the inner loop the second step uses the fast LL multiply code.
> On my P4-1400 this takes 0.06 seconds or 84 million clocks or (dividing
> by 156,250) 538 clocks per trial factor.  This is quite promising.  I haven't
> measured it but the current code can test one trial factor in about 4000
> clocks.

My scheme has 17 levels so the upper bound of the _average_ 
time per trial factor should be 17 * 538 clocks. This is too darned 
slow but, as I indicated above, the actual execution time should be 
significantly less than that.

But I thought the time to actually test a 64-bit factor - not counting 
sieving overheads - was less than 2000 clocks ??? 
> 
> The stumbling block is how can we quickly do the first step of the inner
> loop - multiplying 156,250 bit trial factors together.  If someone has a
> clever algorithm that can do this in fewer than 3500 clocks per trial factor
> then we may have an improved factoring algorithm!

The other point I was trying to make is that a similar algorithm 
could be used to compute the product of small primes used by P-1 
stage 1 (and maybe stage 2 as well) with a highly significant 
improvement over the current method. There may also be an 
application in ECM.

Regards
Brian Beesley
Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to