This really isn't BitC related, except that it happens to be annoying me at the moment and it's a mildly interesting implementation issue.
Summary: 1. There are a lot of ways to be inefficient and/or unintentionally imprecise when converting back and forth between strings and numbers. 2. There are a lot of ways to invest too much energy in a given problem. At the moment, I'm working on a tokenizer for a little language that happens to accept numbers. In the process, I'm trying to make the tokenization as fast as a generic tokenizer can reasonably be, and since I seem to write a lot of tokenizers, I'm trying to make the code as reusable as I reasonably can. When you build a system that compiles from one language to another, it's convenient to handle numeric values using the native numeric types of the compiler language. More so now that more or less every language agrees on what floating point numbers look like. Doing that conversion in the tokenizer is logically the wrong place to do it (I'll say why below), but you're sitting there looking at the digit characters for validation in any case, so you might as well do the conversion while you are at it. There are edge cases with integers that you can't check resulting from twos-complement and unary minus, but you can check everything else. So mainly from force of habit I was coding that up in the tokenizer. It also turns out that modern languages all support 64-bit integer types, and that a 64-bit integer has both more *binary* significant digits and more *decimal* significant digits than an IEEE binary64 floating point value (double, to you and me). In consequence, there is a fairly elegant way to implement both conversions using a partially fused conversion algorithm. The "traditional" algorithm for converting a string to a floating point value goes something like this: 1. Convert the whole number part, if any, exactly as if it were an integer literal. 2. Now convert the fractional part using the multiplicative technique to arrive at the same fraction in base 2 (or any other base) to the required number of significant digits. 3. Finally convert the exponent separately and assemble the resulting pieces. But there is a simpler way that I have never seen documented: 1. Convert the entire significand to a 64-bit unsigned integer, keeping track of where the decimal point was found but otherwise ignoring it, as if the significand was an integer-valued literal having no decimal point. Take care to stop if the integer result would overflow. In the event of overflow, the integer value *before* the overflowing computation contains all of the needed significant digits for the final floating point value. 2. Convert the decimal exponent in the usual way for converting integer literals. 3. Add or subtract from the resulting converted exponent to account for the position of the decimal point. 4. Convert the resulting integer literal to a floating point value, and multiply that by a second floating point value that applies the appropriate exponentiation to achieve the final result. This works because: - Changes to a floating exponent value during multiplication do not alter the significand, except when they drive the value to +/- zero, in which case the correct result is produced. - So long as you were careful about unsigned integer overflow, all of the significant digits you need are present. And note that the unsigned integer you get in the process, provided it did * not* overflow, is the correct result value for converting an integer-valued string to a numeric value as well. This means that you can perform the conversion steps and then choose the result type and the possible treatment of the exponent according to what you discover when you arrive at the fractional part or the exponent. Meanwhile, if the u64 conversion result overflowed, you aren't going to get out a valid integer-typed conversion in any case, so there's no point converting the digits past the point where overflow might have occurred. You still need to process them for input validation (which is being done in this same loop), but not for conversion. There are several objections to this method: - You'll certainly process a few more significant digits than is required for floating point conversion. The overhead of this is lower than the overhead of iterating over the string-encoded digits twice. - The fused method breaks down if the number of effective binary floating point significand digits is greater than or equal to the number of bits in the integer type. This occurs with all extended precision and binary128 floating point values, and with most larger "decimal" values, notably including the one specified in C# (which, it turns out, is not a CLR type). - In the tokenizer, you don't have enough context to know whether the literal is preceded by a unary negation, so you can't range check twos-complement numbers completely. Notably, you can't tell whether 32768 is a legal negative twos-complement 16 bit number or requires promotion to a positive 16 bit unsigned value because of its magnitude. It could be argued that if *some* of the validation has to be deferred to the parser, then maybe *all* of the validation should be done there. - The fusion is trickier if the language supports arbitrary-precision integer literals. One solution is to store a side copy of the interim conversion result at the point where you have collected enough digits to build the floating point value. - Since you can't range check here in any case, it might be better to adopt a more universal value format consisting of (signed bignum significand, signed exponent, type code), which represents all known floating point types and all integer-valued types with more or less equal facility for internal purposes. Taking all of these objections into account, the only one that seems compelling is the last one. The "large floating point types" case is inconvenient, but if you keep track of where you stopped considering significand digits, and later discover when you fall off the end of the significand string that you needed them, you can handle that as a special case *after* the fused algorithm has run. >From a storage management perspective, there is a down side to the (signed bignum significand, exponent, type code) approach, which is that (on its face) it requires more allocation and triggers marginally more GC. This is mitigated by the fact that most constants are small-valued integers, and these can be preallocated. In any case, it's a cute piece of completely over-analyzed code fusion. I thought some of you might find some relief in discovering that I am not immune to apple polishing.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
