Simplified strtonum(3)

Anthony Coulter Tue, 17 Jan 2023 13:02:57 -0800

Hi @tech,

The bottom of this email contains a simplified strtonum implementation.
It should behave exactly the same way as the existing strtonum, but it
doesn't call strtoll, which means we can drop some unnecessary
complexity:
- No base conversion
- No need to save and restore errno (which is a bigger deal than it
  sounds like because of the weird mutable array-of-structures that's
  used to recover errno in the existing strtonum.c)
- No complicated sign-dependent "cutoff/cutlim" tests like the ones we
  see in strtoll.


That last point is most interesting. We're doing our 10*accumulator + d
arithmetic in a 64-bit register, but we can report overflow as soon as
we blow past 2^63 = -LLONG_MIN, which means that we have a bit of
headroom in which we have some wiggle room when we try to determine
whether 10*accumulator + d will cause an integer overflow in the
register itself.

Let "max = (-LLONG_MIN) / 10 + 1" be an unsigned compile-time constant,
let "acc" be an unsigned accumulator, and let 0 <= d <= 9 be a digit.
Then checking for overflow when computing 10*acc + d is easy because of
the following two theorems:

[1]     If (acc <= max) then 10*acc + d <= ULLONG_MAX
[2]     If (acc > max) then 10*acc + d > -LLONG_MIN > LLONG_MAX

Proof:

[1] Suppose acc <= max. Then:
        10*acc + d      <= 10*max + d
                        =  10*(floor(2^63/10) + 1) + d
                        <= 10*(2^63/10 + 1) + d
                        =  10*2^63/10 + 10 + d
                        =  2^63 + 10 + d
                        <= 2^63 + 19
                        <= ULLONG_MAX

[2] Suppose acc > max. Then:
        -LLONG_MIN      = 2^63
                        = 10*(2^63/10)
                        < 10*(floor(2^63/10) + 1)
                        = 10*max
                        < 10*acc
                        <= 10*acc + d

The meaning of [1] is that if (acc <= max) then 10*acc + d will not
overflow an unsigned buffer; the meaning of [2] is that if (acc > max)
then you're guaranteed to overflow the buffer so there's no need to
bother reading the digit. Thus we can loop through the digit-string "s"
with the following loop:

        unsigned long long acc = 0;
        do {
                char c = *s++;
                if (!isdigit(c))        return ERROR_INVALID;
                else if (acc > max)     return ERROR_OVERFLOW;
                acc = 10*acc + (c - '0');
        } while (*++s);
        puts("acc did not overflow");

There are five possibilities:
1. If the string is empty, this returns ERROR_INVALID.
2. If the string has non-digit characters, this returns ERROR_INVALID.
3. If the number > ULLONG_MAX, this returns ERROR_OVERFLOW.
4. If the number <= -LLONG_MIN, this prints "acc did not overflow."
5. If -LLONG_MIN <= number <= ULLONG_MAX then the loop may either
   return ERROR_OVERFLOW or successfully print "acc did not overflow."
   The underlying number should count as having overflowed because
   neither it nor its negative can fit in a long long, but we will
   handle this case in the next step.

The next thing we do is correct the sign. If "neg" is a boolean which
tells whether a minus sign preceded the digit string, the following
code will jam our unsigned quantity "acc" into the signed variable "ll"
and test for overflow. Note that -LLONG_MIN == LLONG_MIN but that this
code handles that case correctly whether neg is true or false:

        unsigned long long acc = <from above>
        long long ll;
        if (neg) {
                ll = -acc;
                if (ll > 0) return ERROR_TOOSMALL;
        } else {
                ll = acc;
                if (ll < 0) return ERROR_TOOLARGE;
        }
        puts("ll did not overflow!");

If we're still here, then ll has not been truncated, so we can compare
against the user-defined minval and maxval:

        if (ll < minval)        return ERROR_TOOSMALL;
        else if (ll > maxval)   return ERROR_TOOLARGE;
        else                    return SUCCESS;

That's basically it. I'll mention that similar logic can be used in
strtol and strtoll. If we redo the proofs of [1] and [2] with an
arbitrary base in place of the number 10 we find that the proof of
property [2] doesn't change, and the only questionable part of the
proof of property [1] is the last two steps, which rely on the
assumption that:

        -INT_MIN + base + d <= UINT_MAX

We know that d < base, and on a two's complement machine (which I
believe is required by POSIX) this will hold whenever:

        -INT_MIN + base + d <= UINT_MAX
        -INT_MIN + base + (base-1) <= UINT_MAX
        -INT_MIN + 2*base - 1 <= UINT_MAX
        2*base <= (UINT_MAX + 1) + INT_MIN
        2*base <= 2^w - 2^(w-1)
        2*base <= 2^(w-1)
        base <= 2^(w-2)

The strto* family of functions support only base <= 36, and the C
standard requires a word size of w >= 16 for integers, so we have
plenty of space between -INT_MIN and UINT_MAX:

        base <= 36 <= 16384 = 2^(16 - 2) <= 2^(w-2)

That said, because this trick makes use of the maybe-overflow-maybe-not
values between -INT_MIN and UINT_MAX, it cannot be applied to the
unsigned variants strtoul and strtoull.

A complete implementation is provided below, along with a test harness
to catch issues around the critical boundaries of INT_MIN and INT_MAX.
I also added some assertions to the main loop to prove dynamically that
the computation of 10*acc + d doesn't overflow; obviously you would
remove these once you're comfortable that the approach is valid.

Regards,
Anthony Coulter


=== strtonum.c ===

#include <assert.h>
#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long long strtonum2(const char *s, long long minval, long long maxval,
    const char **errstrp)
{
        enum { INVALID = 0, TOOSMALL = 1, TOOLARGE = 2 };
        static const struct errval {
                const char *errstr;
                int err;
        } ev[3] = {
                { "invalid",    EINVAL },
                { "too small",  ERANGE },
                { "too large",  ERANGE },
        };

        int error;
        if (minval > maxval) {
                error = INVALID;
                goto error;
        }

        while (isspace(*s))
                s++;

        bool neg;
        if (*s == '-') {
                neg = true;
                s++;
        } else {
                neg = false;
                if (*s == '+')
                        s++;
        }

        // The cast to (unsigned long long) avoids undefined behavior
        // corresponding to the signed integer overflow that takes place
        // when computing -LLONG_MIN. Strictly speaking, once we cast it
        // there's no need to negate the result, but I include the minus
        // sign for clarity and also because it's evaluated at compile
        // time.
        unsigned long long max = (-(unsigned long long)LLONG_MIN) / 10 + 1;
        unsigned long long acc = 0;
        do {
                if (!isdigit(*s)) {
                        error = INVALID;
                        goto error;
                } else if (acc > max) {
                        error = neg ? TOOSMALL : TOOLARGE;
                        goto error;
                }

                /*********************************/
                //
                // Let's make sure we don't overflow the accumulator.
                //
                // An imperfect test:
                assert(acc <= 10*acc + (*s - '0'));
                // A perfect test:
                unsigned long long product, dummy;
                assert(!__builtin_umulll_overflow(10, acc, &product));
                assert(!__builtin_uaddll_overflow(product, *s, &dummy));
                /*********************************/

                acc = 10*acc + (*s - '0');
        } while (*++s != '\0');

        long long ll;
        if (neg) {
                ll = -acc;
                if (ll > 0) {
                        error = TOOSMALL;
                        goto error;
                }
        } else {
                ll = acc;
                if (ll < 0) {
                        error = TOOLARGE;
                        goto error;
                }
        }

        if (ll < minval) {
                error = TOOSMALL;
                goto error;
        } else if (ll > maxval) {
                error = TOOLARGE;
                goto error;
        }

        if (errstrp != NULL)
                *errstrp = NULL;
        return ll;

 error:
        errno = ev[error].err;
        if (errstrp != NULL)
                *errstrp = ev[error].errstr;
        return 0;
}

static bool test(const char *s, long long minval, long long maxval)
{
        const char *err1, *err2;
        int errno1, errno2;
        errno = 1234;
        long long a = strtonum(s, minval, maxval, &err1);
        errno1 = errno;
        errno = 1234;
        long long b = strtonum2(s, minval, maxval, &err2);
        errno2 = errno;

        // The strings might be NULL
        bool stringsmatch = (!err1 && !err2 || err1 && err2 && strcmp(err1, 
err2) == 0);

        if (a != b || errno1 != errno2 || !stringsmatch) {
                printf("%d%d%d\n", a == b, errno1==errno2, stringsmatch);
                printf("Them: %lld/%d/%s\n", a, errno1, err1 ? err1 : "(null)");
                printf("Me  : %lld/%d/%s\n", b, errno2, err2 ? err2 : "(null)");
        }
        return a == b && errno1 == errno2 && stringsmatch;
}

int main(int argc, char *argv[])
{
        static const char *nums[] = {
                "", // Oops, almost forgot to test the empty string
                "0",
                "1",
                "00000000000000000000000000123",
                "     -789",
                "    \t\n+666",
                "-+12",
                "+-34",

                "9223372036854775797@",
                "9223372036854775798",  // 2^63 - 10
                "9223372036854775799",
                "9223372036854775800",
                "9223372036854775801",
                "9223372036854775802",
                "9223372036854775803",
                "9223372036854775804",
                "9223372036854775805",
                "9223372036854775806",
                "9223372036854775807",
                "9223372036854775808",  // 2^63
                "9223372036854775809",
                "9223372036854775810",
                "9223372036854775811",
                "9223372036854775812",
                "9223372036854775813",
                "9223372036854775814",
                "9223372036854775815",
                "9223372036854775816",
                "9223372036854775817",
                "9223372036854775818",  // 2^63 + 10
                "9223372036854775819@",
                "-9223372036854775797@",
                "-9223372036854775798", // -(2^63 - 10)
                "-9223372036854775799",
                "-9223372036854775800",
                "-9223372036854775801",
                "-9223372036854775802",
                "-9223372036854775803",
                "-9223372036854775804",
                "-9223372036854775805",
                "-9223372036854775806",
                "-9223372036854775807",
                "-9223372036854775808", // -2^63
                "-9223372036854775809",
                "-9223372036854775810",
                "-9223372036854775811",
                "-9223372036854775812",
                "-9223372036854775813",
                "-9223372036854775814",
                "-9223372036854775815",
                "-9223372036854775816",
                "-9223372036854775817",
                "-9223372036854775818", // -(2^63 + 10)
                "-9223372036854775819@",

                // Let's also test numbers around 2^64.
                "18446744073709551606", // 2^64 - 10
                "18446744073709551607",
                "18446744073709551608",
                "18446744073709551609",
                "18446744073709551610",
                "18446744073709551611",
                "18446744073709551612",
                "18446744073709551613",
                "18446744073709551614",
                "18446744073709551615",
                "18446744073709551616", // 2^64
                "18446744073709551617",
                "18446744073709551618",
                "18446744073709551619",
                "18446744073709551620",
                "18446744073709551621",
                "18446744073709551622",
                "18446744073709551623",
                "18446744073709551624",
                "18446744073709551625",
                "18446744073709551626", // 2^64 + 10
                "18446744073709551615",
        };

        unsigned long long bounds[] = {
                LLONG_MIN,
                LLONG_MIN+3,
                LLONG_MIN+7,
                -5,
                0,
                12,
                LLONG_MAX-8,
                LLONG_MAX-2,
                LLONG_MAX,
        };

        for (size_t i = 0; i < sizeof nums/sizeof nums[0]; i++)
        for (size_t j = 0; j < sizeof bounds/sizeof bounds[0]; j++)
        for (size_t k = 0; k < sizeof bounds/sizeof bounds[0]; k++) {
                if (!test(nums[i], bounds[j], bounds[k]))
                        printf("Test failure for %s [%lld, %lld]\n", nums[i],
                                bounds[j], bounds[k]);
        }

        return 0;
}

Simplified strtonum(3)

Reply via email to