Rich Schroeppel <[EMAIL PROTECTED]> writes:

<<<
The cost analysis of trial factoring by GCDs of 2^P-1 with the
product of many small candidate divisors ignores an important
optimization: All the candidates can be multiplied together
mod 2^P-1, and ONLY ONE GCD NEEDS TO BE DONE. The major cost is
probably the multiplications. Reduction mod 2^P-1 is particularly
fast, but the same principle applies to random ultra-large targets.

The products are probably best done by grouping pairs of trial
divisors, then pairs of pairs, etc. The lowest levels can be
chopped away if we use divisors in arithmetic progressions, which
can be grouped by (say) eight at a time, and the products evaluated
as an octic polynomial, which can be reduced to evaluating eight
bignum additions per polynomial. The easiest way to develop the
necessary finite difference parameters is to evaluate a few
products directly, and calculate the differences. This dodges
the nuisance of computing polynomial products and Stirling numbers.
>>>

An excellent point, Rich. So now (assuming we do >> 1 p-bit input
products) the single gcd is negligible. Generating each p-bit input
product needs a product of 2 p/2-bit terms, which were generated
from 4 p/4-bit terms, etc. With log2(p) levels in this recursion,
generating each input vector needs O(p * ((log(p))^2) work, which
is asymptotically the same as for the gcd, but with a much smaller
constant buried in the O() (Modular multiply of each new input
vector with the accumulated modular product needs just a single
p-bit multiply and O(p * log(p)) work, and thus is negligible
compared to generating the input vector from the individual factor
candidates themselves). Thus the asymptotic opcount remains
a factor of log(p) times as great as for the one-candidate (or a
few candidates) at a time method, but there are benefits to this
method which may make it attractive, especially on hardware where
we can do gcd very fast (e.g. using floating-point), and when the
size of the candidate factors exceeds the size that lends itself to
fast hardware multiply.

Are you aware of anyone who has implemented an efficient version of
such a method? It would be nice to have some idea of its speed in
practice before spending the significant programming time it would
need to code a (say) Mersenne-optimized version of it for various
hardware targets.

-Ernst

Reply via email to