Recently, I've been playing around with the use of the Chinese
Remainder Theorem (CRT) to effect efficient short-length
convolution algorithms. For those of you not familiar with
CRT, a brief tutorial: given two length-n vectors,

A = [a_0, a_1, ... , a_(n-1)] and B = [b_0, b_1, ... , b_(n-1)],

we can view each as the coefficients of a polynomial of degree n-1,
i.e. the vector A represents the polynomial

a0 + a1*x + ... + a_(n-1)*x^(n-1) .

Cyclic convolution of the vectors A and B now corresponds to
the polynomial product modulo P = x^n - 1 (i.e. every output term
of the full double-length polynomial product of form
c_(n+k)*x^(n+k) gets 'folded' in with the c_k*x^k term, since
x^(n+k) == x^k modulo x^n - 1); acyclic convolution corresponds
to polynomial product modulo Q = x^n + 1 (output terms folded
similarly but with a sign change, e.g. c_k*x^k + ... + c_(n+k)*x^(n+k)
becomes [c_k - c_(n+k)]*x^k ).

If the polynomials P and Q can be factored over the rational numbers,
then CRT allows us to reduce the input polynomials modulo the factors,
compute a set of shorter-length subconvolutions (also modulo the factors)
and then reconstruct the desired outputs of the convolutions modulo P
and Q from the outputs of the shorter subconvolutions. If P and Q are
highly composite, one can generally construct highly efficient convolution
algorithms this way.

As one specific case, I've been working on length-6 convolutions.
For n=6, P and Q have the factorizations

P = x^6-1 = (x^3-1)*(x^3+1) = (x-1)*(x+1)*(x^2-x+1)*(x^2+x+1),

Q = x^6+1 = (x^2+1)*(x^4-x^2+1).

Naive convolution of two length-6 real vectors needs 36 multiplies and
36 adds (same for cyclic and acyclic). The goal is to try to minimize
the total operation count (add + multiply), with special emphasis on
multiply (that means, given two alternatives with similar total opcount,
we prefer the one with fewer multiplies.)

Using CRT, and considering one of the input vectors to be fixed
(for instance if we were doing many convolutions with different
A inputs but all with the same B) the best I've managed to achieve
so far is:

Cyclic (mod P) convolution:  8 mul, 40 add
Acyclic (mod Q) convolution: 15 mul, 41 add .

(I can do the acyclic with as few as 12 mul, but only at the cost
of ~10 more adds.)

Does anyone know of algorithms more efficient than the above?

* * *

On a related note: for odd-length convolutions, we may not need
to consider cyclic and acyclic separately. That's because we can
usually (I suspect always, but haven't proven this yet) find a
combination of sign changes on the input terms which converts
one into the other, so we need only consider the cheaper of the
two (which usually appears to be the cyclic, although why this
should be so isn't obvious to me - perhaps x^n - 1 generally
is smoother in terms of its factorization than is x^n + 1; for
instance, x^n - 1 always has at least the factor x-1.) As an
example, consider a length-3 acyclic convolution, with output
coefficients as follows:

x^0: a0.b0-a1.b2-a2.b1
x^1: a0.b1+a1.b0-a2.b2
x^2: a0.b2+a1.b1+a2.b0

If we flip the sign of a0, a2 and b1, we get a cyclic convolution
out of this, just with the signs of the x^0 and x^2 outputs flipped.

Interestingly, one of the 'bibles' of the field, by H. Nussbaumer,
doesn't seem to be aware of this trick: Nussbaumer's length-3 cyclic
convolution algorithm needs 4 mul, 11 add; his length-3 acyclic, OTOH,
needs 5 mul and a whopping 20 add. This is insane, since as just shown
we can very easily convert a length-3 acyclic into a length-3 cyclic.

For even-length convolutions, this trick fails, since we can only flip
an even number of signs this way, and a length-n acyclic has precisely
n*(n-1)/2 minuses, which is always an odd number. However, there one
can ask whether there exists a set of sign changes on the inputs which
converts all but one of the signs to pluses. In that case, one could
simply do a length-n cyclic convolution (assuming it's more efficient
than its acyclic counterpart) on the sign-changed inputs,
and then one additional multiply and two adds would suffice to fix the
single residual minus-signed term. I need to see whether this can be
done for the length-6 case: in that case the opcount would drop to
a mere 9 multiplies and 42 adds.

-Ernst

Reply via email to