[bitc-dev] A side note on tokenizing numbers and numeric conversions

Jonathan S. Shapiro Wed, 31 Jul 2013 13:29:30 -0700

This really isn't BitC related, except that it happens to be annoying me at
the moment and it's a mildly interesting implementation issue.


Summary:

1. There are a lot of ways to be inefficient and/or unintentionally
imprecise when converting back and forth between strings and numbers.
2. There are a lot of ways to invest too much energy in a given problem.

At the moment, I'm working on a tokenizer for a little language that
happens to accept numbers. In the process, I'm trying to make the
tokenization as fast as a generic tokenizer can reasonably be, and since I
seem to write a lot of tokenizers, I'm trying to make the code as reusable
as I reasonably can.

When you build a system that compiles from one language to another, it's
convenient to handle numeric values using the native numeric types of the
compiler language. More so now that more or less every language agrees on
what floating point numbers look like. Doing that conversion in the
tokenizer is logically the wrong place to do it (I'll say why below), but
you're sitting there looking at the digit characters for validation in any
case, so you might as well do the conversion while you are at it. There are
edge cases with integers that you can't check resulting from
twos-complement and unary minus, but you can check everything else. So
mainly from force of habit I was coding that up in the tokenizer.

It also turns out that modern languages all support 64-bit integer types,
and that a 64-bit integer has both more *binary* significant digits and
more *decimal* significant digits than an IEEE binary64 floating point
value (double, to you and me). In consequence, there is a fairly elegant
way to implement both conversions using a partially fused conversion
algorithm.

The "traditional" algorithm for converting a string to a floating point
value goes something like this:

   1. Convert the whole number part, if any, exactly as if it were an
   integer literal.
   2. Now convert the fractional part using the multiplicative technique to
   arrive at the same fraction in base 2 (or any other base) to the required
   number of significant digits.
   3. Finally convert the exponent separately and assemble the resulting
   pieces.

But there is a simpler way that I have never seen documented:

   1. Convert the entire significand to a 64-bit unsigned integer, keeping
   track of where the decimal point was found but otherwise ignoring it, as if
   the significand was an integer-valued literal having no decimal point. Take
   care to stop if the integer result would overflow. In the event of
   overflow, the integer value *before* the overflowing computation
   contains all of the needed significant digits for the final floating point
   value.
   2. Convert the decimal exponent in the usual way for converting integer
   literals.
   3. Add or subtract from the resulting converted exponent to account for
   the position of the decimal point.
   4. Convert the resulting integer literal to a floating point value, and
   multiply that by a second floating point value that applies the appropriate
   exponentiation to achieve the final result.

This works because:

   - Changes to a floating exponent value during multiplication do not
   alter the significand, except when they drive the value to +/- zero, in
   which case the correct result is produced.
   - So long as you were careful about unsigned integer overflow, all of
   the significant digits you need are present.

And note that the unsigned integer you get in the process, provided it did *
not* overflow, is the correct result value for converting an integer-valued
string to a numeric value as well. This means that you can perform the
conversion steps and then choose the result type and the possible treatment
of the exponent according to what you discover when you arrive at the
fractional part or the exponent. Meanwhile, if the u64 conversion result
overflowed, you aren't going to get out a valid integer-typed conversion in
any case, so there's no point converting the digits past the point where
overflow might have occurred. You still need to process them for input
validation (which is being done in this same loop), but not for conversion.

There are several objections to this method:

   - You'll certainly process a few more significant digits than is
   required for floating point conversion. The overhead of this is lower than
   the overhead of iterating over the string-encoded digits twice.
   - The fused method breaks down if the number of effective binary
   floating point significand digits is greater than or equal to the number of
   bits in the integer type. This occurs with all extended precision and
   binary128 floating point values, and with most larger "decimal" values,
   notably including the one specified in C# (which, it turns out, is not a
   CLR type).
   - In the tokenizer, you don't have enough context to know whether the
   literal is preceded by a unary negation, so you can't range check
   twos-complement numbers completely. Notably, you can't tell whether 32768
   is a legal negative twos-complement 16 bit number or requires promotion to
   a positive 16 bit unsigned value because of its magnitude. It could be
   argued that if *some* of the validation has to be deferred to the
   parser, then maybe *all* of the validation should be done there.
   - The fusion is trickier if the language supports arbitrary-precision
   integer literals. One solution is to store a side copy of the interim
   conversion result at the point where you have collected enough digits to
   build the floating point value.
   - Since you can't range check here in any case, it might be better to
   adopt a more universal value format consisting of (signed bignum
   significand, signed exponent, type code), which represents all known
   floating point types and all integer-valued types with more or less equal
   facility for internal purposes.

Taking all of these objections into account, the only one that seems
compelling is the last one. The "large floating point types" case is
inconvenient, but if you keep track of where you stopped considering
significand digits, and later discover when you fall off the end of the
significand string that you needed them, you can handle that as a special
case *after* the fused algorithm has run.

>From a storage management perspective, there is a down side to the (signed
bignum significand, exponent, type code) approach, which is that (on its
face) it requires more allocation and triggers marginally more GC. This is
mitigated by the fact that most constants are small-valued integers, and
these can be preallocated.


In any case, it's a cute piece of completely over-analyzed code fusion. I
thought some of you might find some relief in discovering that I am not
immune to apple polishing.

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

[bitc-dev] A side note on tokenizing numbers and numeric conversions

Reply via email to