I happened to stumble across isqrt() in busybox and observed that
it can be shrunk and sped up at the same time.

It's only used in two places - diff(1) and factor(1).
The diff usage doesn't require handling 64-bit inputs, so the code size
could be reduced (especially on 32-bit platforms) by changing the
argument data type depending on whether factor(1) is enabled.  But
that's beyond my understanding of BB's build system.  I'm still not
sure what preprocessor symbol to use:
CONFIG_FACTOR
ENABLE_FACTOR
FEATURE_FACTOR
IF_FACTOR

It gets more involved because the obvious way to do it is to
use a typedef for the argument type, and that basically amounts
to duplicating factor's wide_t and half_t logic.

I'm also not sure what architectures to do size comparisons on.
ARM? x86? MIPS? Risc-V?  And 32 or 64 bits?

Anyway, the current isqrt code is basically:

uint32_t isqrt0(uint64_t N)
{
        uint32_t x = 0;
        int_fast8_t signed char shift = 8*sizeof N - 2;
        do {
                x = 2 * x + 1;
                if ((uint64_t)x * x > N >> shift)
                        x--; /* whoops, that +1 was too much */
        } while ((shift -= 2) >= 0);
        return x;
}

This compiles (-m64 -Os) to:
0000000000000000 <isqrt0>:
   0:   ba 3e 00 00 00          mov    $0x3e,%edx
   5:   31 c0                   xor    %eax,%eax
   7:   01 c0                   add    %eax,%eax
   9:   c4 62 eb f7 d1          shrx   %rdx,%rcx,%r10
   e:   44 8d 40 01             lea    0x1(%rax),%r8d
  12:   4d 89 c1                mov    %r8,%r9
  15:   4d 0f af c0             imul   %r8,%r8
  19:   4d 39 c2                cmp    %r8,%r10
  1c:   41 0f 43 c1             cmovae %r9d,%eax
  20:   83 ea 02                sub    $0x2,%edx
  23:   83 fa fe                cmp    $0xfffffffe,%edx
  26:   75 df                   jne    7 <isqrt0+0x7>
  28:   c3                      ret
(0x29 = 41 bytes)

This shows a bit of a problem: even with -Os, gcc isn't happy with
the perfectly nice loop termination condition and adds its own
separate loop-termination test.  This can be defeated with a
surprisingly portable bit of in-line asm():

uint32_t isqrt0a(uint64_t N)
{
        uint32_t x = 0;
        int_fast8_t shift = 8*sizeof N - 2;
        asm("" : "+g" (shift));
        do {
                x = 2 * x + 1;
                if ((uint64_t)x * x > N >> shift)
                        x--; /* whoops, that +1 was too much */
        } while ((shift -= 2) >= 0);
        return x;
}
The asm generates no instructions, but prevents the compiler from
trying to get clever with the number of loop iterations.  This
compiles to:

00000000000002bf <isqrt0a>:
 2bf:   31 d2                   xor    %edx,%edx
 2c1:   b0 3e                   mov    $0x3e,%al
 2c3:   01 d2                   add    %edx,%edx
 2c5:   c4 62 fb f7 d1          shrx   %rax,%rcx,%r10
 2ca:   44 8d 42 01             lea    0x1(%rdx),%r8d
 2ce:   4d 89 c1                mov    %r8,%r9
 2d1:   4d 0f af c0             imul   %r8,%r8
 2d5:   4d 39 c2                cmp    %r8,%r10
 2d8:   41 0f 43 d1             cmovae %r9d,%edx
 2dc:   2c 02                   sub    $0x2,%al
 2de:   79 e3                   jns    2c3 <isqrt0a+0x4>
 2e0:   89 d0                   mov    %edx,%eax
 2e2:   c3                      ret
(0x24 = 36 bytes)

This saves a 3-byte comparison at the end of the loop, and another
3 bytes because it can use an 8-bit register, which saves 3 bytes
of immediate when initializing it.

It saves another byte bytes by using %al for the shift amount,
allowing 1-byte shorter subtract instruction, but loses 2 bytes
to a mov instruction before returning, for a net total of -5 bytes.

However, even this can be beaten by a faster algorithm cribbed from
the Linux kernel.  Since the licenses are compatible, I didn't bother
rewriting it, but did trade space for time by removing the input-
dependent initialization of m, and just do 32 iterations always.

uint32_t isqrt1a(uint64_t x)
{
        uint64_t y = 0;
        uint64_t m = (uint64_t)1 << (8*sizeof x - 2);

        do {
                uint64_t b = y + m;
                y >>= 1;
                if (x >= b) {
                        x -= b;
                        y += m;
                }
                m >>= 2;
        } while (m);
        return (uint32_t)y;
}

This is significantly faster due to the absence of multiplies, and
it compiles to
0000000000000029 <isqrt1a>:
  29:   ba 01 00 00 00          mov    $0x1,%edx
  2e:   41 b8 20 00 00 00       mov    $0x20,%r8d
  34:   31 c0                   xor    %eax,%eax
  36:   48 c1 e2 3e             shl    $0x3e,%rdx
  3a:   4c 8d 0c 10             lea    (%rax,%rdx,1),%r9
  3e:   48 d1 e8                shr    $1,%rax
  41:   4c 39 c9                cmp    %r9,%rcx
  44:   72 06                   jb     4c <isqrt1a+0x23>
  46:   4c 29 c9                sub    %r9,%rcx
  49:   48 01 d0                add    %rdx,%rax
  4c:   48 c1 ea 02             shr    $0x2,%rdx
  50:   41 ff c8                dec    %r8d
  53:   75 e5                   jne    3a <isqrt1a+0x11>
  55:   c3                      ret
(0x2d = 45 bytes)

Again, we see gcc decides it doesn't like the perfectly good loop
counter provided by m (%rdx), and thinks it needs to create its
own separate counter in %r8d, producing larger code than the
original isqrt0.  But the same asm("") trick work on this:

uint32_t isqrt1b(uint64_t x)
{
        uint64_t y = 0;
        uint64_t m = (uint64_t)1 << (8 * sizeof x - 2);
        asm("" : "+g" (m));

        do {
                uint64_t b = y + m;
                y >>= 1;
                if (x >= b) {
                        x -= b;
                        y += m;
                }
                m >>= 2;
        } while (m);
        return (uint32_t)y;
}

compiles to

0000000000000056 <isqrt1b>:
  56:   ba 01 00 00 00          mov    $0x1,%edx
  5b:   31 c0                   xor    %eax,%eax
  5d:   48 c1 e2 3e             shl    $0x3e,%rdx
  61:   4c 8d 04 10             lea    (%rax,%rdx,1),%r8
  65:   48 d1 e8                shr    $1,%rax
  68:   4c 39 c1                cmp    %r8,%rcx
  6b:   72 06                   jb     73 <isqrt1b+0x1d>
  6d:   4c 29 c1                sub    %r8,%rcx
  70:   48 01 d0                add    %rdx,%rax
  73:   48 c1 ea 02             shr    $0x2,%rdx
  77:   75 e8                   jne    61 <isqrt1b+0xb>
  79:   c3                      ret
(0x24 = 36 bytes)

The same size as isqrt0a, and faster even though it's not branch-free.
But we can do a bit better yet.  The initialization of m in %rdx is done
with a 9-byte, 2-instruction sequence: mov $1,%edx; shl $62,%rdx, which
is almost as large as movabs $0x4000000000000000,%rdx (10 bytes).  But
GCC doesn't seem to know that is doesn't have to use a 4-byte immediate
to load %edx.  Only the low 2 bits matter after the shift, so we can
use a shorter byte move at the expense of an ugly architecture-specific
#ifdef and a dependency on the previous value of %edx:

uint32_t isqrt1c(uint64_t x)
{
        uint64_t y = 0;
#ifndef __x86_64__
        uint64_t m = (uint64_t)1 << (8 * sizeof x - 2);
        asm("" : "+g" (m));
#else
        uint64_t m;
        asm("movb $1,%b0" : "=r" (m));
        m <<= (8 * sizeof x - 2);
#endif

        do {
                uint64_t b = y + m;
                y >>= 1;
                if (x >= b) {
                        x -= b;
                        y += m;
                }
                m >>= 2;
        } while (m);
        return (uint32_t)y;
}

000000000000007a <isqrt1c>:
  7a:   31 c0                   xor    %eax,%eax
  7c:   b2 01                   mov    $0x1,%dl
  7e:   48 c1 e2 3e             shl    $0x3e,%rdx
  82:   4c 8d 04 10             lea    (%rax,%rdx,1),%r8
  86:   48 d1 e8                shr    $1,%rax
  89:   4c 39 c1                cmp    %r8,%rcx
  8c:   72 06                   jb     94 <isqrt1c+0x1a>
  8e:   4c 29 c1                sub    %r8,%rcx
  91:   48 01 d0                add    %rdx,%rax
  94:   48 c1 ea 02             shr    $0x2,%rdx
  98:   75 e8                   jne    82 <isqrt1c+0x8>
  9a:   c3                      ret
(0x31 = 33 bytes)

Looking at this led me to look at factor(1).  First, there are ways the
current trial division could be improved.  The packing of the mod-2310
wheel is very impressive, but the unpacking seems wasteful.  It could
easily be merged with the trial division loop, or the wheel could be
generated algorithmically without the compressed table.

The comment that large cofactors take a long time for trial division
to prove prime is certainly true.  There are plenty of fairly simple
ways to verify primality or factor more efficiently (Pollard rho,
Hart's one line factorization), but they all rely on a foundation of arithmetic 
modulo n and (for the factoring part) GCD computation.
Which is more code.

I'm not sure what the desired tradeoff here is.  Modulo-n arithmetic
requires widening multiply for intermediate values, which is both slow
and a lot of code to emulate if the platform doesn't have it.  Maybe
I should have a look at the arbitrary-precision code in dc(1)?

Also, the way squares are handled by recursion into factorize()
seems inefficient.  If I ask for the factorization of p^2 for a
large prime p (say, 2^32-5), I'll trial divide up to p, then
recurse and trial divide up to sqrt(p) again.

Admittedly, it's a tiny fraction of the work already done, but
it's completely pointless; we already know there are no factors
in that range.

If you want to return an odd root estimate r and a number of exact
square roots s, and don't want to use a struct or global variables,
then return r << s.  It's guaranteed to fit into 32 bits (it's either
0xffffffff, or 0xffff << 1 or 0xff << 2, or...) and easy to unpack.

But it's probably simpler to reorder the main loop so there's only
one isqrt_odd() call, then things can be inlined, giving easier
access to loop variables.

Any suggestions for tinkering in this area?

_______________________________________________
busybox mailing list
[email protected]
https://lists.busybox.net/mailman/listinfo/busybox

Reply via email to