Hi @tech,
The bottom of this email contains a simplified strtonum implementation.
It should behave exactly the same way as the existing strtonum, but it
doesn't call strtoll, which means we can drop some unnecessary
complexity:
- No base conversion
- No need to save and restore errno (which is a bigger deal than it
sounds like because of the weird mutable array-of-structures that's
used to recover errno in the existing strtonum.c)
- No complicated sign-dependent "cutoff/cutlim" tests like the ones we
see in strtoll.
That last point is most interesting. We're doing our 10*accumulator + d
arithmetic in a 64-bit register, but we can report overflow as soon as
we blow past 2^63 = -LLONG_MIN, which means that we have a bit of
headroom in which we have some wiggle room when we try to determine
whether 10*accumulator + d will cause an integer overflow in the
register itself.
Let "max = (-LLONG_MIN) / 10 + 1" be an unsigned compile-time constant,
let "acc" be an unsigned accumulator, and let 0 <= d <= 9 be a digit.
Then checking for overflow when computing 10*acc + d is easy because of
the following two theorems:
[1] If (acc <= max) then 10*acc + d <= ULLONG_MAX
[2] If (acc > max) then 10*acc + d > -LLONG_MIN > LLONG_MAX
Proof:
[1] Suppose acc <= max. Then:
10*acc + d <= 10*max + d
= 10*(floor(2^63/10) + 1) + d
<= 10*(2^63/10 + 1) + d
= 10*2^63/10 + 10 + d
= 2^63 + 10 + d
<= 2^63 + 19
<= ULLONG_MAX
[2] Suppose acc > max. Then:
-LLONG_MIN = 2^63
= 10*(2^63/10)
< 10*(floor(2^63/10) + 1)
= 10*max
< 10*acc
<= 10*acc + d
The meaning of [1] is that if (acc <= max) then 10*acc + d will not
overflow an unsigned buffer; the meaning of [2] is that if (acc > max)
then you're guaranteed to overflow the buffer so there's no need to
bother reading the digit. Thus we can loop through the digit-string "s"
with the following loop:
unsigned long long acc = 0;
do {
char c = *s++;
if (!isdigit(c)) return ERROR_INVALID;
else if (acc > max) return ERROR_OVERFLOW;
acc = 10*acc + (c - '0');
} while (*++s);
puts("acc did not overflow");
There are five possibilities:
1. If the string is empty, this returns ERROR_INVALID.
2. If the string has non-digit characters, this returns ERROR_INVALID.
3. If the number > ULLONG_MAX, this returns ERROR_OVERFLOW.
4. If the number <= -LLONG_MIN, this prints "acc did not overflow."
5. If -LLONG_MIN <= number <= ULLONG_MAX then the loop may either
return ERROR_OVERFLOW or successfully print "acc did not overflow."
The underlying number should count as having overflowed because
neither it nor its negative can fit in a long long, but we will
handle this case in the next step.
The next thing we do is correct the sign. If "neg" is a boolean which
tells whether a minus sign preceded the digit string, the following
code will jam our unsigned quantity "acc" into the signed variable "ll"
and test for overflow. Note that -LLONG_MIN == LLONG_MIN but that this
code handles that case correctly whether neg is true or false:
unsigned long long acc = <from above>
long long ll;
if (neg) {
ll = -acc;
if (ll > 0) return ERROR_TOOSMALL;
} else {
ll = acc;
if (ll < 0) return ERROR_TOOLARGE;
}
puts("ll did not overflow!");
If we're still here, then ll has not been truncated, so we can compare
against the user-defined minval and maxval:
if (ll < minval) return ERROR_TOOSMALL;
else if (ll > maxval) return ERROR_TOOLARGE;
else return SUCCESS;
That's basically it. I'll mention that similar logic can be used in
strtol and strtoll. If we redo the proofs of [1] and [2] with an
arbitrary base in place of the number 10 we find that the proof of
property [2] doesn't change, and the only questionable part of the
proof of property [1] is the last two steps, which rely on the
assumption that:
-INT_MIN + base + d <= UINT_MAX
We know that d < base, and on a two's complement machine (which I
believe is required by POSIX) this will hold whenever:
-INT_MIN + base + d <= UINT_MAX
-INT_MIN + base + (base-1) <= UINT_MAX
-INT_MIN + 2*base - 1 <= UINT_MAX
2*base <= (UINT_MAX + 1) + INT_MIN
2*base <= 2^w - 2^(w-1)
2*base <= 2^(w-1)
base <= 2^(w-2)
The strto* family of functions support only base <= 36, and the C
standard requires a word size of w >= 16 for integers, so we have
plenty of space between -INT_MIN and UINT_MAX:
base <= 36 <= 16384 = 2^(16 - 2) <= 2^(w-2)
That said, because this trick makes use of the maybe-overflow-maybe-not
values between -INT_MIN and UINT_MAX, it cannot be applied to the
unsigned variants strtoul and strtoull.
A complete implementation is provided below, along with a test harness
to catch issues around the critical boundaries of INT_MIN and INT_MAX.
I also added some assertions to the main loop to prove dynamically that
the computation of 10*acc + d doesn't overflow; obviously you would
remove these once you're comfortable that the approach is valid.
Regards,
Anthony Coulter
=== strtonum.c ===
#include <assert.h>
#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long long strtonum2(const char *s, long long minval, long long maxval,
const char **errstrp)
{
enum { INVALID = 0, TOOSMALL = 1, TOOLARGE = 2 };
static const struct errval {
const char *errstr;
int err;
} ev[3] = {
{ "invalid", EINVAL },
{ "too small", ERANGE },
{ "too large", ERANGE },
};
int error;
if (minval > maxval) {
error = INVALID;
goto error;
}
while (isspace(*s))
s++;
bool neg;
if (*s == '-') {
neg = true;
s++;
} else {
neg = false;
if (*s == '+')
s++;
}
// The cast to (unsigned long long) avoids undefined behavior
// corresponding to the signed integer overflow that takes place
// when computing -LLONG_MIN. Strictly speaking, once we cast it
// there's no need to negate the result, but I include the minus
// sign for clarity and also because it's evaluated at compile
// time.
unsigned long long max = (-(unsigned long long)LLONG_MIN) / 10 + 1;
unsigned long long acc = 0;
do {
if (!isdigit(*s)) {
error = INVALID;
goto error;
} else if (acc > max) {
error = neg ? TOOSMALL : TOOLARGE;
goto error;
}
/*********************************/
//
// Let's make sure we don't overflow the accumulator.
//
// An imperfect test:
assert(acc <= 10*acc + (*s - '0'));
// A perfect test:
unsigned long long product, dummy;
assert(!__builtin_umulll_overflow(10, acc, &product));
assert(!__builtin_uaddll_overflow(product, *s, &dummy));
/*********************************/
acc = 10*acc + (*s - '0');
} while (*++s != '\0');
long long ll;
if (neg) {
ll = -acc;
if (ll > 0) {
error = TOOSMALL;
goto error;
}
} else {
ll = acc;
if (ll < 0) {
error = TOOLARGE;
goto error;
}
}
if (ll < minval) {
error = TOOSMALL;
goto error;
} else if (ll > maxval) {
error = TOOLARGE;
goto error;
}
if (errstrp != NULL)
*errstrp = NULL;
return ll;
error:
errno = ev[error].err;
if (errstrp != NULL)
*errstrp = ev[error].errstr;
return 0;
}
static bool test(const char *s, long long minval, long long maxval)
{
const char *err1, *err2;
int errno1, errno2;
errno = 1234;
long long a = strtonum(s, minval, maxval, &err1);
errno1 = errno;
errno = 1234;
long long b = strtonum2(s, minval, maxval, &err2);
errno2 = errno;
// The strings might be NULL
bool stringsmatch = (!err1 && !err2 || err1 && err2 && strcmp(err1,
err2) == 0);
if (a != b || errno1 != errno2 || !stringsmatch) {
printf("%d%d%d\n", a == b, errno1==errno2, stringsmatch);
printf("Them: %lld/%d/%s\n", a, errno1, err1 ? err1 : "(null)");
printf("Me : %lld/%d/%s\n", b, errno2, err2 ? err2 : "(null)");
}
return a == b && errno1 == errno2 && stringsmatch;
}
int main(int argc, char *argv[])
{
static const char *nums[] = {
"", // Oops, almost forgot to test the empty string
"0",
"1",
"00000000000000000000000000123",
" -789",
" \t\n+666",
"-+12",
"+-34",
"9223372036854775797@",
"9223372036854775798", // 2^63 - 10
"9223372036854775799",
"9223372036854775800",
"9223372036854775801",
"9223372036854775802",
"9223372036854775803",
"9223372036854775804",
"9223372036854775805",
"9223372036854775806",
"9223372036854775807",
"9223372036854775808", // 2^63
"9223372036854775809",
"9223372036854775810",
"9223372036854775811",
"9223372036854775812",
"9223372036854775813",
"9223372036854775814",
"9223372036854775815",
"9223372036854775816",
"9223372036854775817",
"9223372036854775818", // 2^63 + 10
"9223372036854775819@",
"-9223372036854775797@",
"-9223372036854775798", // -(2^63 - 10)
"-9223372036854775799",
"-9223372036854775800",
"-9223372036854775801",
"-9223372036854775802",
"-9223372036854775803",
"-9223372036854775804",
"-9223372036854775805",
"-9223372036854775806",
"-9223372036854775807",
"-9223372036854775808", // -2^63
"-9223372036854775809",
"-9223372036854775810",
"-9223372036854775811",
"-9223372036854775812",
"-9223372036854775813",
"-9223372036854775814",
"-9223372036854775815",
"-9223372036854775816",
"-9223372036854775817",
"-9223372036854775818", // -(2^63 + 10)
"-9223372036854775819@",
// Let's also test numbers around 2^64.
"18446744073709551606", // 2^64 - 10
"18446744073709551607",
"18446744073709551608",
"18446744073709551609",
"18446744073709551610",
"18446744073709551611",
"18446744073709551612",
"18446744073709551613",
"18446744073709551614",
"18446744073709551615",
"18446744073709551616", // 2^64
"18446744073709551617",
"18446744073709551618",
"18446744073709551619",
"18446744073709551620",
"18446744073709551621",
"18446744073709551622",
"18446744073709551623",
"18446744073709551624",
"18446744073709551625",
"18446744073709551626", // 2^64 + 10
"18446744073709551615",
};
unsigned long long bounds[] = {
LLONG_MIN,
LLONG_MIN+3,
LLONG_MIN+7,
-5,
0,
12,
LLONG_MAX-8,
LLONG_MAX-2,
LLONG_MAX,
};
for (size_t i = 0; i < sizeof nums/sizeof nums[0]; i++)
for (size_t j = 0; j < sizeof bounds/sizeof bounds[0]; j++)
for (size_t k = 0; k < sizeof bounds/sizeof bounds[0]; k++) {
if (!test(nums[i], bounds[j], bounds[k]))
printf("Test failure for %s [%lld, %lld]\n", nums[i],
bounds[j], bounds[k]);
}
return 0;
}