On Tue, Feb 3, 2026 at 3:52 PM David Sparks <[email protected]> wrote:
> I happened to stumble across isqrt() in busybox and observed that
> it can be shrunk and sped up at the same time.
>
> It's only used in two places - diff(1) and factor(1).
> The diff usage doesn't require handling 64-bit inputs, so the code size
> could be reduced (especially on 32-bit platforms) by changing the
> argument data type depending on whether factor(1) is enabled.  But
> that's beyond my understanding of BB's build system.  I'm still not
> sure what preprocessor symbol to use:
> CONFIG_FACTOR
> ENABLE_FACTOR
> FEATURE_FACTOR
> IF_FACTOR

I'll add a separate version just for "unsigned long".

> However, even this can be beaten by a faster algorithm cribbed from
> the Linux kernel.  Since the licenses are compatible, I didn't bother
> rewriting it, but did trade space for time by removing the input-
> dependent initialization of m, and just do 32 iterations always.
>
> uint32_t isqrt1a(uint64_t x)
> {
>         uint64_t y = 0;
>         uint64_t m = (uint64_t)1 << (8*sizeof x - 2);
>
>         do {
>                 uint64_t b = y + m;
>                 y >>= 1;
>                 if (x >= b) {
>                         x -= b;
>                         y += m;
>                 }
>                 m >>= 2;
>         } while (m);
>         return (uint32_t)y;
> }

Looks good. I don't think it's worth going to asm() tricks
to make gcc optimize better.
(It's better to spend time to improve gcc to do it better for everyone
...but yes, it's also much harder).


> Looking at this led me to look at factor(1).  First, there are ways the
> current trial division could be improved.  The packing of the mod-2310
> wheel is very impressive, but the unpacking seems wasteful.

For any even remotely taxing integer to factorize, unpack_wheel()
takes insignificant time to execute by comparison.

>  It could
> easily be merged with the trial division loop, or the wheel could be
> generated algorithmically without the compressed table.

Adding more code to trial division loop will cause additional register pressure
(=> slower code).

> The comment that large cofactors take a long time for trial division
> to prove prime is certainly true.  There are plenty of fairly simple
> ways to verify primality or factor more efficiently (Pollard rho,
> Hart's one line factorization), but they all rely on a foundation of 
> arithmetic modulo n and (for the factoring part) GCD computation.
> Which is more code.

I didn't go for full-blown, "actually smart" code for this,
since I'm not sure we have users.

However, if you absolutely itching to implement something
bigger but faster (or capable of >64bits),
please do submit a patch.
It can always be under a CONFIG option if it's "too big"
to be unconditional.

Frankly, a serious implementation for this would be rather different,
and maybe better be a whole separate code, mostly unrelated
(IOW: not an addition of an ifdef forest to what we have here).

> I'm not sure what the desired tradeoff here is.  Modulo-n arithmetic
> requires widening multiply for intermediate values, which is both slow
> and a lot of code to emulate if the platform doesn't have it.  Maybe
> I should have a look at the arbitrary-precision code in dc(1)?

That one is decimal-based (-> slow).
TLS code has some arbitrary-precision binary code.

> Also, the way squares are handled by recursion into factorize()
> seems inefficient.  If I ask for the factorization of p^2 for a
> large prime p (say, 2^32-5), I'll trial divide up to p, then
> recurse and trial divide up to sqrt(p) again.

I don't think so.

factorize(N) calls isqrt_odd(N) _before_ factorizing.
isqrt_odd(N) will detect that N=p^2
(by discovering that isqrt(N)^2 = N)
and will factorize p *once* (will just print its factors twice).

Tell me what I'm missing.
_______________________________________________
busybox mailing list
[email protected]
https://lists.busybox.net/mailman/listinfo/busybox

Reply via email to