Mersenne: Re: fast GCD

EWMAYER Thu, 28 Oct 1999 16:21:22 -0700
Alex Kruppa wrote:

> Can this be used to do a LL test and Pollard-rho factoring attempt at
> once?

Jason Papadopoulos wrote:

> Except that every once in a while you'd have to perform a GCD, and
> a GCD on million-digit size numbers takes about a day (no kidding!
> There seems to be no fast way to do one!)

A day seems somewhat of an overestimate (though maybe it takes that
long on a Pentium - I don't know.) On a decently fast Alpha, say a 500MHz
21164, I can do a million-digit (and note I mean base-10 digits, not bits)
GCD in under 5 minutes using a simple O(n^2) Lehmer-type implementation
with scalar multipliers of around 63 bits - one gets similar performance
on MIPS after adjusting for the typically lower clock rates. (Things should
be 2-3x faster on Alpha 21264, due its much better integer multiply
The code in question is my own (and will be publically available when I
release my Mersenne p-1 code), but Torbjorn Granlund sent me timings that
indicate the GMP library's GCD is similarly fast on such hardware. The
Alpha implementation uses the UMULH isntuction to get the upper 64 bits
of the 64x64==>128-bit unsigned integer products which arise when
multiplying the digits of a base-2^64 multiword integer by a scalar
of up to 64 bits in size, and the MIPS version uses DMULTU to get the
full 128-bit product. On CPUs supporting 64-bit integers and floats
but lacking such 128-bit multiply instructions, one can emulate UMULH via
a floating multiply, using scalars up to 52 bits in size (we need to save
one of the mantissa bits to effect error correction of the FMUL result.)

On the Intel IA-64, there is hardware support for a (now fully 64-bit)
UMULH-like instruction in the floating multiplier (also used to get the
low 64 bits of the product.) One can in theory do similar on the Pentium,
but for this to be efficient, one needs to be able to load a 64-bit int
directly into the 64-bit mantissa of a floating-point register - I don't
know enough about the x86 to say whether one can do such an operation
on current CPUs. I think George uses similar tricks to do the speedy
Prime95 sieve-based factoring on the Pentium, so it seems it should be
possible.

Lastly, note that one doesn't need to do many GCDs if one is running, say,
a p-1 or ECM factoring algorithm - for large numbers, one could restrict
oneself to to doing a single GCD at the end of stage 1 and perhaps every
day or so in stage 2. This makes a "fast" (i.e. asymptotically faster
than O(n^2)) GCD less pressing.

Alex wrote:

>Thats true (btw, I keep hearing of a n*log(n) gcd - how long would that
>take then? Where can I read up on it - Knuth vol.2 ?)

Richard Candall claims the fastest he's heard of is O(n*log^2(n)), i.e.
a factor of log(n) slower order than a length-n FFT, though for n >> 1,
this should still be much faster than O(n^2). The problem is that it's
likely highly nontrivial to code - using a well-tested library (e.g. GMP)
to effect the basic operations would probably be one's best bet if one
doesn't want to spend months coding it all oneself.

Peter Montgomery's PhD thesis mentions a fast GCD due to Moenck, but
I don't know what the precise work estimate is for that algorithm - Peter?

Other GCD-related musings I've been mulling over:

1) Is there a fast (e.g. O(n*long(n) or such) iterative method to generate
a modular inverse modulo a Mersenne or Fermat number? Of course one can
use an extended GCD algorithm to get the inverse, but I had in mind
something more elegantly, which ideally would allow fast DWT-based FFT
arithmetic to be used, i.e. which avoids having to convert an n-bit
mixed-radix DWT floating-point factoring residue to a fixed-radix integer,
do the EGCD, then convert the result back to floating form in preparation
for more DWT-style arithmetic. This question was inspired by the fact that
one can use a modular analog of the well-known Newton iterative scheme
for multiplicative inverses to get inverses modulo some desired power of 2.
For example, if x is some integer and y_n is the multiplicative inverse of
x modulo, say, 2^16, then one obtains the mod-2^32 inverse via

         y_{n+1} = y_n * (2 - x*y_n),

where all intermediate arithmetic is done modulo 2^32. Alas, this scheme
doesn't work except for moduli which are powers of 2.

2) If a fast iteration as described in (1) did exist, could one use it
to obtain a GCD in similar time?

Cheers,
-Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: fast GCD

Reply via email to