I happened to stumble across isqrt() in busybox and observed that
it can be shrunk and sped up at the same time.
It's only used in two places - diff(1) and factor(1).
The diff usage doesn't require handling 64-bit inputs, so the code size
could be reduced (especially on 32-bit platforms) by changing the
argument data type depending on whether factor(1) is enabled. But
that's beyond my understanding of BB's build system. I'm still not
sure what preprocessor symbol to use:
CONFIG_FACTOR
ENABLE_FACTOR
FEATURE_FACTOR
IF_FACTOR
It gets more involved because the obvious way to do it is to
use a typedef for the argument type, and that basically amounts
to duplicating factor's wide_t and half_t logic.
I'm also not sure what architectures to do size comparisons on.
ARM? x86? MIPS? Risc-V? And 32 or 64 bits?
Anyway, the current isqrt code is basically:
uint32_t isqrt0(uint64_t N)
{
uint32_t x = 0;
int_fast8_t signed char shift = 8*sizeof N - 2;
do {
x = 2 * x + 1;
if ((uint64_t)x * x > N >> shift)
x--; /* whoops, that +1 was too much */
} while ((shift -= 2) >= 0);
return x;
}
This compiles (-m64 -Os) to:
0000000000000000 <isqrt0>:
0: ba 3e 00 00 00 mov $0x3e,%edx
5: 31 c0 xor %eax,%eax
7: 01 c0 add %eax,%eax
9: c4 62 eb f7 d1 shrx %rdx,%rcx,%r10
e: 44 8d 40 01 lea 0x1(%rax),%r8d
12: 4d 89 c1 mov %r8,%r9
15: 4d 0f af c0 imul %r8,%r8
19: 4d 39 c2 cmp %r8,%r10
1c: 41 0f 43 c1 cmovae %r9d,%eax
20: 83 ea 02 sub $0x2,%edx
23: 83 fa fe cmp $0xfffffffe,%edx
26: 75 df jne 7 <isqrt0+0x7>
28: c3 ret
(0x29 = 41 bytes)
This shows a bit of a problem: even with -Os, gcc isn't happy with
the perfectly nice loop termination condition and adds its own
separate loop-termination test. This can be defeated with a
surprisingly portable bit of in-line asm():
uint32_t isqrt0a(uint64_t N)
{
uint32_t x = 0;
int_fast8_t shift = 8*sizeof N - 2;
asm("" : "+g" (shift));
do {
x = 2 * x + 1;
if ((uint64_t)x * x > N >> shift)
x--; /* whoops, that +1 was too much */
} while ((shift -= 2) >= 0);
return x;
}
The asm generates no instructions, but prevents the compiler from
trying to get clever with the number of loop iterations. This
compiles to:
00000000000002bf <isqrt0a>:
2bf: 31 d2 xor %edx,%edx
2c1: b0 3e mov $0x3e,%al
2c3: 01 d2 add %edx,%edx
2c5: c4 62 fb f7 d1 shrx %rax,%rcx,%r10
2ca: 44 8d 42 01 lea 0x1(%rdx),%r8d
2ce: 4d 89 c1 mov %r8,%r9
2d1: 4d 0f af c0 imul %r8,%r8
2d5: 4d 39 c2 cmp %r8,%r10
2d8: 41 0f 43 d1 cmovae %r9d,%edx
2dc: 2c 02 sub $0x2,%al
2de: 79 e3 jns 2c3 <isqrt0a+0x4>
2e0: 89 d0 mov %edx,%eax
2e2: c3 ret
(0x24 = 36 bytes)
This saves a 3-byte comparison at the end of the loop, and another
3 bytes because it can use an 8-bit register, which saves 3 bytes
of immediate when initializing it.
It saves another byte bytes by using %al for the shift amount,
allowing 1-byte shorter subtract instruction, but loses 2 bytes
to a mov instruction before returning, for a net total of -5 bytes.
However, even this can be beaten by a faster algorithm cribbed from
the Linux kernel. Since the licenses are compatible, I didn't bother
rewriting it, but did trade space for time by removing the input-
dependent initialization of m, and just do 32 iterations always.
uint32_t isqrt1a(uint64_t x)
{
uint64_t y = 0;
uint64_t m = (uint64_t)1 << (8*sizeof x - 2);
do {
uint64_t b = y + m;
y >>= 1;
if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
} while (m);
return (uint32_t)y;
}
This is significantly faster due to the absence of multiplies, and
it compiles to
0000000000000029 <isqrt1a>:
29: ba 01 00 00 00 mov $0x1,%edx
2e: 41 b8 20 00 00 00 mov $0x20,%r8d
34: 31 c0 xor %eax,%eax
36: 48 c1 e2 3e shl $0x3e,%rdx
3a: 4c 8d 0c 10 lea (%rax,%rdx,1),%r9
3e: 48 d1 e8 shr $1,%rax
41: 4c 39 c9 cmp %r9,%rcx
44: 72 06 jb 4c <isqrt1a+0x23>
46: 4c 29 c9 sub %r9,%rcx
49: 48 01 d0 add %rdx,%rax
4c: 48 c1 ea 02 shr $0x2,%rdx
50: 41 ff c8 dec %r8d
53: 75 e5 jne 3a <isqrt1a+0x11>
55: c3 ret
(0x2d = 45 bytes)
Again, we see gcc decides it doesn't like the perfectly good loop
counter provided by m (%rdx), and thinks it needs to create its
own separate counter in %r8d, producing larger code than the
original isqrt0. But the same asm("") trick work on this:
uint32_t isqrt1b(uint64_t x)
{
uint64_t y = 0;
uint64_t m = (uint64_t)1 << (8 * sizeof x - 2);
asm("" : "+g" (m));
do {
uint64_t b = y + m;
y >>= 1;
if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
} while (m);
return (uint32_t)y;
}
compiles to
0000000000000056 <isqrt1b>:
56: ba 01 00 00 00 mov $0x1,%edx
5b: 31 c0 xor %eax,%eax
5d: 48 c1 e2 3e shl $0x3e,%rdx
61: 4c 8d 04 10 lea (%rax,%rdx,1),%r8
65: 48 d1 e8 shr $1,%rax
68: 4c 39 c1 cmp %r8,%rcx
6b: 72 06 jb 73 <isqrt1b+0x1d>
6d: 4c 29 c1 sub %r8,%rcx
70: 48 01 d0 add %rdx,%rax
73: 48 c1 ea 02 shr $0x2,%rdx
77: 75 e8 jne 61 <isqrt1b+0xb>
79: c3 ret
(0x24 = 36 bytes)
The same size as isqrt0a, and faster even though it's not branch-free.
But we can do a bit better yet. The initialization of m in %rdx is done
with a 9-byte, 2-instruction sequence: mov $1,%edx; shl $62,%rdx, which
is almost as large as movabs $0x4000000000000000,%rdx (10 bytes). But
GCC doesn't seem to know that is doesn't have to use a 4-byte immediate
to load %edx. Only the low 2 bits matter after the shift, so we can
use a shorter byte move at the expense of an ugly architecture-specific
#ifdef and a dependency on the previous value of %edx:
uint32_t isqrt1c(uint64_t x)
{
uint64_t y = 0;
#ifndef __x86_64__
uint64_t m = (uint64_t)1 << (8 * sizeof x - 2);
asm("" : "+g" (m));
#else
uint64_t m;
asm("movb $1,%b0" : "=r" (m));
m <<= (8 * sizeof x - 2);
#endif
do {
uint64_t b = y + m;
y >>= 1;
if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
} while (m);
return (uint32_t)y;
}
000000000000007a <isqrt1c>:
7a: 31 c0 xor %eax,%eax
7c: b2 01 mov $0x1,%dl
7e: 48 c1 e2 3e shl $0x3e,%rdx
82: 4c 8d 04 10 lea (%rax,%rdx,1),%r8
86: 48 d1 e8 shr $1,%rax
89: 4c 39 c1 cmp %r8,%rcx
8c: 72 06 jb 94 <isqrt1c+0x1a>
8e: 4c 29 c1 sub %r8,%rcx
91: 48 01 d0 add %rdx,%rax
94: 48 c1 ea 02 shr $0x2,%rdx
98: 75 e8 jne 82 <isqrt1c+0x8>
9a: c3 ret
(0x31 = 33 bytes)
Looking at this led me to look at factor(1). First, there are ways the
current trial division could be improved. The packing of the mod-2310
wheel is very impressive, but the unpacking seems wasteful. It could
easily be merged with the trial division loop, or the wheel could be
generated algorithmically without the compressed table.
The comment that large cofactors take a long time for trial division
to prove prime is certainly true. There are plenty of fairly simple
ways to verify primality or factor more efficiently (Pollard rho,
Hart's one line factorization), but they all rely on a foundation of arithmetic
modulo n and (for the factoring part) GCD computation.
Which is more code.
I'm not sure what the desired tradeoff here is. Modulo-n arithmetic
requires widening multiply for intermediate values, which is both slow
and a lot of code to emulate if the platform doesn't have it. Maybe
I should have a look at the arbitrary-precision code in dc(1)?
Also, the way squares are handled by recursion into factorize()
seems inefficient. If I ask for the factorization of p^2 for a
large prime p (say, 2^32-5), I'll trial divide up to p, then
recurse and trial divide up to sqrt(p) again.
Admittedly, it's a tiny fraction of the work already done, but
it's completely pointless; we already know there are no factors
in that range.
If you want to return an odd root estimate r and a number of exact
square roots s, and don't want to use a struct or global variables,
then return r << s. It's guaranteed to fit into 32 bits (it's either
0xffffffff, or 0xffff << 1 or 0xff << 2, or...) and easy to unpack.
But it's probably simpler to reorder the main loop so there's only
one isqrt_odd() call, then things can be inlined, giving easier
access to loop variables.
Any suggestions for tinkering in this area?
_______________________________________________
busybox mailing list
[email protected]
https://lists.busybox.net/mailman/listinfo/busybox