Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > On 24 November 2015 at 16:20, Tom Lane wrote: >> ... None of these effects are going to let the final div[qi+1] value >> get to more than two or three times NBASE squared, which is still >> an order of magnitude less than INT_MAX. > Right. In fact I think div[qi+1] is even more constrained than that (see > below). Thanks for the additional analysis and testing. >> I think all this deserves some code comments, but I'm not sure whether >> we need to bother with casting to int64. Thoughts? > I think probably just a comment is OK. I'm inclined to agree. The only reason I'd worry at all is that the compiler guys are getting ever more aggressive about claiming that any integer overflow can result in arbitrarily-broken behavior. However, we already have a lot of other code that depends on -fwrapv or equivalent overflow behavior, so I doubt that this is the first place to worry about it. The underlying hardware is certainly going to behave as we wish, and it's pretty hard to see how a compiler could "optimize" the assembly code in a way that would break such a simple calculation. I'll go add some comments and call it good. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
On 24 November 2015 at 16:20, Tom Lane wrote: > After further thought I've pretty well convinced myself that there is > indeed no observable bug, at least as long as you assume that overflow > within the multiplication will behave as stated above. The proof goes > like this: > > We already know that the divisor-subtraction step cannot cause an overflow > (modulo the question of div[qi] possibly exceeding the maxdiv limit, > which I'll get back to). So what is at stake is just the possibility of > overflow in the final update that transfers leftover digits from div[qi] > to div[qi+1]. > > To analyze that, consider just the first two dividend and divisor terms in > the qdigit calculation, that is during each loop iteration approximate > qdigit as > qdigit ~= trunc((div[qi]*NBASE + div[qi+1]) / > (var2digits[0]*NBASE + var2digits[1])) > This holds whether or not we do a carry propagation step. Now write > what the divisor-subtraction step updates div[qi] to as > div[qi]' = div[qi] - qdigit * var2digits[0] > and the updated div[qi+1] as > div[qi+1]' = div[qi+1] - qdigit * var2digits[1] > So, if we temporarily disregard the possibility of overflow within the > final calculation > div[qi+1]'' = div[qi+1]' + div[qi]'*NBASE > we can see that div[qi+1]'' is going to be > = div[qi+1]' + div[qi]'*NBASE > = div[qi+1] - qdigit * var2digits[1] + > (div[qi] - qdigit * var2digits[0])*NBASE > = div[qi]*NBASE + div[qi+1] - > qdigit * (var2digits[0]*NBASE + var2digits[1]) > Comparing that to the approximate value of qdigit, we can see that > what we've got here is a modulo calculation, and the value of div[qi+1]'' > is going to cancel to zero except for a remainder modulo > var2digits[0]*NBASE + var2digits[1], which of course must fall between 0 > and NBASE*NBASE+NBASE (since the truncation is towards minus infinity). > Yes that makes sense. > Now of course this is only an approximation. The main source of error is > that we've omitted the lower-order dividend terms. Including div[qi+2] > could change the result by at most about INT_MAX/NBASE (which is the most > it could add to the numerator in the qdigit expression above, which will > propagate more or less linearly to div[qi+1]''). Similarly, adding > div[qi+3] could add at most INT_MAX/NBASE/NBASE. So we aren't anywhere > near the overflow threshold. Including the lower-order divisor terms > could change the value of qdigit by a multiplicative factor of no more > than about 1/NBASE. Lastly, roundoff error in the floating-point part > of the calculation could cause qdigit to be off by one count either > way. None of these effects are going to let the final div[qi+1] value > get to more than two or three times NBASE squared, which is still > an order of magnitude less than INT_MAX. > Right. In fact I think div[qi+1] is even more constrained than that (see below). > Therefore, the final div[qi+1] value cannot overflow an int, and even > though it might be larger than what maxdiv would suggest, it's not going > to be large enough to cause overflow in the next loop iteration either. > It obviously won't cause overflow during carry propagation, and as for > the next divisor-subtraction step, we can do an analysis similar to the > above but approximating qdigit with just these terms: > qdigit ~= trunc((div[qi] + div[qi+1]/NBASE) / var2digits[0]) > Plugging that into > div[qi]' = div[qi] - qdigit * var2digits[0] > shows that the updated div[qi] in any loop iteration is going to be just > about -div[qi+1]/NBASE, plus a truncation term that's between 0 and > var2digits[0], plus lower-order terms that aren't going to get you very > much past INT_MAX/NBASE. So div[qi]' is never an overflow hazard in any > loop iteration. > > In short, it's impossible for the end result of the div[qi+1] update > calculation to overflow, or even to get large enough to create a problem > in the next loop iteration. However, the intermediate result > div[qi]*NBASE can overflow as per the example I showed before. > Agreed. I tried analysing this in a different way, by considering the maximum possible error in the floating point computations. fdividend takes up to 4 digits from the dividend: fdividend = div[qi] * NBASE^3 + div[qi+1] * NBASE^2 + div[qi+2] * NBASE + div[qi+3] and the digits being discarded are limited to a little less than +/-INT_MAX, so the error in fdividend due to discarding digits is at most +/-INT_MAX/NBASE. fdivisor is computed in a similar way, except that the var2 digits are in the range [0,NBASE-1], so fdivisor is in the range [NBASE^3, NBASE^4-1]. Therefore the maximum error in the computation of fquotient is +/-INT_MAX/NBASE^4, which is less than 1. So the most this can do is to cause a single integer boundary to be crossed, and therefore qdigit will be within +/-1 of the correct result. That's consistent with the observation tha
Re: [HACKERS] Bug in numeric multiplication
I wrote: > Note that div[qi+1], and indeed all the remaining dividend digits, are > large negative values. This is what you'd expect if the carry propagation > step hasn't run for awhile, which is a precondition for div[qi] being > large enough to cause an issue. When we compute 218943 * 1, we will > indeed get an overflow, and the result will wrap around to some large > negative value (2^32 less than it should be). Then we will add that to > div[qi+1], and we'll get *another* overflow, wrapping what nominally > should have been a negative sum around to a positive value (2^32 more than > it should be). So the two overflows cancel and we get exactly the correct > new value of div[qi+1]. > I do not know whether it's possible to devise a test case where you don't > get offsetting overflows. It may be that there's no operational bug here. > Still, the code is surely not behaving per face value. After further thought I've pretty well convinced myself that there is indeed no observable bug, at least as long as you assume that overflow within the multiplication will behave as stated above. The proof goes like this: We already know that the divisor-subtraction step cannot cause an overflow (modulo the question of div[qi] possibly exceeding the maxdiv limit, which I'll get back to). So what is at stake is just the possibility of overflow in the final update that transfers leftover digits from div[qi] to div[qi+1]. To analyze that, consider just the first two dividend and divisor terms in the qdigit calculation, that is during each loop iteration approximate qdigit as qdigit ~= trunc((div[qi]*NBASE + div[qi+1]) / (var2digits[0]*NBASE + var2digits[1])) This holds whether or not we do a carry propagation step. Now write what the divisor-subtraction step updates div[qi] to as div[qi]' = div[qi] - qdigit * var2digits[0] and the updated div[qi+1] as div[qi+1]' = div[qi+1] - qdigit * var2digits[1] So, if we temporarily disregard the possibility of overflow within the final calculation div[qi+1]'' = div[qi+1]' + div[qi]'*NBASE we can see that div[qi+1]'' is going to be = div[qi+1]' + div[qi]'*NBASE = div[qi+1] - qdigit * var2digits[1] + (div[qi] - qdigit * var2digits[0])*NBASE = div[qi]*NBASE + div[qi+1] - qdigit * (var2digits[0]*NBASE + var2digits[1]) Comparing that to the approximate value of qdigit, we can see that what we've got here is a modulo calculation, and the value of div[qi+1]'' is going to cancel to zero except for a remainder modulo var2digits[0]*NBASE + var2digits[1], which of course must fall between 0 and NBASE*NBASE+NBASE (since the truncation is towards minus infinity). Now of course this is only an approximation. The main source of error is that we've omitted the lower-order dividend terms. Including div[qi+2] could change the result by at most about INT_MAX/NBASE (which is the most it could add to the numerator in the qdigit expression above, which will propagate more or less linearly to div[qi+1]''). Similarly, adding div[qi+3] could add at most INT_MAX/NBASE/NBASE. So we aren't anywhere near the overflow threshold. Including the lower-order divisor terms could change the value of qdigit by a multiplicative factor of no more than about 1/NBASE. Lastly, roundoff error in the floating-point part of the calculation could cause qdigit to be off by one count either way. None of these effects are going to let the final div[qi+1] value get to more than two or three times NBASE squared, which is still an order of magnitude less than INT_MAX. Therefore, the final div[qi+1] value cannot overflow an int, and even though it might be larger than what maxdiv would suggest, it's not going to be large enough to cause overflow in the next loop iteration either. It obviously won't cause overflow during carry propagation, and as for the next divisor-subtraction step, we can do an analysis similar to the above but approximating qdigit with just these terms: qdigit ~= trunc((div[qi] + div[qi+1]/NBASE) / var2digits[0]) Plugging that into div[qi]' = div[qi] - qdigit * var2digits[0] shows that the updated div[qi] in any loop iteration is going to be just about -div[qi+1]/NBASE, plus a truncation term that's between 0 and var2digits[0], plus lower-order terms that aren't going to get you very much past INT_MAX/NBASE. So div[qi]' is never an overflow hazard in any loop iteration. In short, it's impossible for the end result of the div[qi+1] update calculation to overflow, or even to get large enough to create a problem in the next loop iteration. However, the intermediate result div[qi]*NBASE can overflow as per the example I showed before. We could just ignore this on the grounds that it doesn't matter given sane behavior of integer arithmetic. Or, if we wanted to be more paranoid, we could do the multiplication-and-addition in int64 arithmetic; but that would probably slow things d
Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > On 18 November 2015 at 22:19, Tom Lane wrote: >> Still, this proves that we are onto a real problem. > Agreed. I had a go at turning that example into something using > log(base, num) so that the result would be visible in a pure SQL test, > but I didn't have any luck. I experimented with that idea some, and found a test case that would trigger the Assert during log_var's final div_var_fast call: select log( exp(.5050012831598249352382434889825483159817) , exp(.0099)); However, the emitted answer was the same with or without my proposed patch. So I dug deeper by putting in some debug printf's, and found that the overflow occurs at this point: div[81] will overflow: div[qi] = 218943 qdigit = 7673 maxdiv = 210375 div[]: 0001 -001 5049 9871 6840 1750 6476 1756 5110 1745 1684 0183 0860 9567 3786 7660 2671 2843 5974 6515 4013 7886 0978 7230 5372 8162 7077 2013 6185 3746 3095 8528 1595 3071 7541 7920 7950 218943 -2103498897 -2103499629 -2103499629 -2103499629 -2103504578 Note that div[qi+1], and indeed all the remaining dividend digits, are large negative values. This is what you'd expect if the carry propagation step hasn't run for awhile, which is a precondition for div[qi] being large enough to cause an issue. When we compute 218943 * 1, we will indeed get an overflow, and the result will wrap around to some large negative value (2^32 less than it should be). Then we will add that to div[qi+1], and we'll get *another* overflow, wrapping what nominally should have been a negative sum around to a positive value (2^32 more than it should be). So the two overflows cancel and we get exactly the correct new value of div[qi+1]. I do not know whether it's possible to devise a test case where you don't get offsetting overflows. It may be that there's no operational bug here. Still, the code is surely not behaving per face value. > I had a go at trying to find a simpler approach and came up with the > attached, which computes the value intended for div[qi+1] near the top > of the loop (using doubles) so that it can detect if it will overflow, > and if so it enters the normalisation block. It still needs some work > to prove that the normalised value for fnextdigit won't overflow, but > it feels like a promising, easier-to-follow approach to me. What do > you think? I'll look at this later ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
On 18 November 2015 at 22:19, Tom Lane wrote: > A bit of progress on this: I've found a test case, namely > > select sqrt(99... > Wow. > Still, this proves that we are onto a real problem. > Agreed. I had a go at turning that example into something using log(base, num) so that the result would be visible in a pure SQL test, but I didn't have any luck. >> Another issue here is that with outercarry added into the qdigit >> computation, it's no longer clear what the bounds of qdigit itself are, > > I concluded that that particular issue is a red herring: qdigit should > always be a fairly accurate estimate of the next output digit, so it > cannot fall very far outside the range 0..NBASE-1. Testing confirms that > the range seen in the regression tests is exactly -1 to 1, which is > what you'd expect since there can be some roundoff error. > Right, I came to the same conclusion. > Also, after further thought I've been able to construct an argument why > outercarry stays bounded. See what you think of the comments in the > attached patch, which incorporates your ideas about postponing the div[qi] > calculation. I think the logic is sound, but I worry that that comment is going to be very difficult to follow. I had a go at trying to find a simpler approach and came up with the attached, which computes the value intended for div[qi+1] near the top of the loop (using doubles) so that it can detect if it will overflow, and if so it enters the normalisation block. It still needs some work to prove that the normalised value for fnextdigit won't overflow, but it feels like a promising, easier-to-follow approach to me. What do you think? Regards, Dean diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c new file mode 100644 index dcdc5cf..02146f7 *** a/src/backend/utils/adt/numeric.c --- b/src/backend/utils/adt/numeric.c *** div_var_fast(NumericVar *var1, NumericVa *** 6186,6192 double fdividend, fdivisor, fdivisorinverse, ! fquotient; int qi; int i; --- 6186,6193 double fdividend, fdivisor, fdivisorinverse, ! fquotient, ! fnextdigit; int qi; int i; *** div_var_fast(NumericVar *var1, NumericVa *** 6274,6279 --- 6275,6284 * To avoid overflow in maxdiv itself, it represents the max absolute * value divided by NBASE-1, ie, at the top of the loop it is known that * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1). + * + * Note that maxdiv only includes digit positions that are still part of + * the dividend, excluding the first dividend digit, ie, strictly speaking + * the above holds only for div[i] where i > qi, at the top of the loop. */ maxdiv = 1; *** div_var_fast(NumericVar *var1, NumericVa *** 6295,6371 qdigit = (fquotient >= 0.0) ? ((int) fquotient) : (((int) fquotient) - 1); /* truncate towards -infinity */ ! if (qdigit != 0) { ! /* Do we need to normalize now? */ ! maxdiv += Abs(qdigit); ! if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1)) { ! /* Yes, do it */ ! carry = 0; ! for (i = div_ndigits; i > qi; i--) { ! newdig = div[i] + carry; ! if (newdig < 0) ! { ! carry = -((-newdig - 1) / NBASE) - 1; ! newdig -= carry * NBASE; ! } ! else if (newdig >= NBASE) ! { ! carry = newdig / NBASE; ! newdig -= carry * NBASE; ! } ! else ! carry = 0; ! div[i] = newdig; } ! newdig = div[qi] + carry; ! div[qi] = newdig; ! ! /* ! * All the div[] digits except possibly div[qi] are now in the ! * range 0..NBASE-1. ! */ ! maxdiv = Abs(newdig) / (NBASE - 1); ! maxdiv = Max(maxdiv, 1); ! ! /* ! * Recompute the quotient digit since new info may have ! * propagated into the top four dividend digits ! */ ! fdividend = (double) div[qi]; ! for (i = 1; i < 4; i++) { ! fdividend *= NBASE; ! if (qi + i <= div_ndigits) ! fdividend += (double) div[qi + i]; } ! /* Compute the (approximate) quotient digit */ ! fquotient = fdividend * fdivisorinverse; ! qdigit = (fquotient >= 0.0) ? ((int) fquotient) : ! (((int) fquotient) - 1); /* truncate towards -infinity */ ! maxdiv += Abs(qdigit); } ! /* Subtract off the appropriate multiple of the divisor */ ! if (qdigit != 0) ! { ! int istop = Min(var2ndigits, div_ndigits - qi + 1); ! for (i = 0; i < istop; i++) ! div[qi + i] -= qdigit * var2digits[i]; } } /* ! * The dividend digit we are about to replace might still be nonzero. ! * Fold it into the next digit position. We don't need to worry about ! * overflow here since this should nearly cancel with the subtraction ! * of the divisor. */ ! div[qi + 1] += div[qi] * NBASE; div[qi] = qdigit; } --- 6300,6393
Re: [HACKERS] Bug in numeric multiplication
I wrote: > I'm kind of stuck on that too. I did some experimentation by tracking > maximum values of outercarry in the regression tests (including > numeric_big) and did not see it get larger than a couple hundred thousand, > ie more or less INT_MAX/NBASE. But I don't have an argument as to why > that would be the limit. A bit of progress on this: I've found a test case, namely select sqrt(.00); If one inserts no-overflow Asserts into div_var_fast, this case will trip them, specifically this: /* * The dividend digit we are about to replace might still be nonzero. * Fold it into the next digit position. We don't need to worry about * overflow here since this should nearly cancel with the subtraction * of the divisor. */ + Assert(Abs(div[qi]) <= INT_MAX/NBASE); div[qi + 1] += div[qi] * NBASE; Unfortunately, there isn't any SQL-visible misbehavior in this example, because the loop in sqrt_var is more or less self-correcting for minor errors (and this example produces bogus results only in very low-order digits). Most of the other calls of div_var_fast give it inputs that are even harder to control than sqrt_var's, so finding something that did produce visibly wrong results might take a whole lot of trial and error. Still, this proves that we are onto a real problem. > Another issue here is that with outercarry added into the qdigit > computation, it's no longer clear what the bounds of qdigit itself are, I concluded that that particular issue is a red herring: qdigit should always be a fairly accurate estimate of the next output digit, so it cannot fall very far outside the range 0..NBASE-1. Testing confirms that the range seen in the regression tests is exactly -1 to 1, which is what you'd expect since there can be some roundoff error. Also, after further thought I've been able to construct an argument why outercarry stays bounded. See what you think of the comments in the attached patch, which incorporates your ideas about postponing the div[qi] calculation. regards, tom lane diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index dcdc5cf..2c6a3f9 100644 *** a/src/backend/utils/adt/numeric.c --- b/src/backend/utils/adt/numeric.c *** div_var_fast(NumericVar *var1, NumericVa *** 6186,6192 double fdividend, fdivisor, fdivisorinverse, ! fquotient; int qi; int i; --- 6186,6193 double fdividend, fdivisor, fdivisorinverse, ! fquotient, ! outercarry; int qi; int i; *** div_var_fast(NumericVar *var1, NumericVa *** 6274,6289 * To avoid overflow in maxdiv itself, it represents the max absolute * value divided by NBASE-1, ie, at the top of the loop it is known that * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1). */ maxdiv = 1; /* ! * Outer loop computes next quotient digit, which will go into div[qi] */ for (qi = 0; qi < div_ndigits; qi++) { /* Approximate the current dividend value */ ! fdividend = (double) div[qi]; for (i = 1; i < 4; i++) { fdividend *= NBASE; --- 6275,6297 * To avoid overflow in maxdiv itself, it represents the max absolute * value divided by NBASE-1, ie, at the top of the loop it is known that * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1). + * + * Note that maxdiv only includes digit positions that are still part of + * the dividend, ie, strictly speaking the above holds only for div[i] + * where i >= qi, at the top of the loop. */ maxdiv = 1; /* ! * Outer loop computes next quotient digit, which will go into div[qi]. ! * "outercarry" represents high-order dividend digits carried across ! * iterations. */ + outercarry = 0; for (qi = 0; qi < div_ndigits; qi++) { /* Approximate the current dividend value */ ! fdividend = outercarry * NBASE + (double) div[qi]; for (i = 1; i < 4; i++) { fdividend *= NBASE; *** d
Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > On 17 November 2015 at 23:57, Tom Lane wrote: >> I'm not sure that it's provably okay though. The loose ends are to show >> that div[qi] can't overflow an int during the divisor-subtraction step >> and that "outercarry" remains bounded. Testing suggests that outercarry >> can't exceed INT_MAX/NBASE, but I don't see how to prove that. > At the top of the loop we know that > Abs(div[qi]) <= maxdiv * (NBASE-1) ><= INT_MAX - INT_MAX/NBASE - 1 > Then we add Abs(qdigit) to maxdiv which potentially triggers a > normalisation step. That step adds carry to div[qi], and we know that > Abs(carry) <= INT_MAX/NBASE + 1 > so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't > overflow at that point, but it could be on the verge of doing so. Right. > Then the divisor-subtraction step does > div[qi] -= qdigit * var2digits[0] > That looks as though it could overflow, although actually I would > expect qdigit to have the same sign as div[qi], so that this would > drive div[qi] towards zero. But if outercarry is nonzero, then qdigit is going to be primarily driven by outercarry not div[qi], so I'm afraid it's mistaken to rely on it having the right sign to cause cancellation. > If you didn't want to rely on that though, > you could take advantage of the fact that this new value of div[qi] is > only needed for outercarry, so you could modify the > divisor-subtraction step to skip div[qi]: > and fold the most significant digit of the divisor-subtraction step > into the computation of outercarry Cute idea. Another thought I'd had is that we could change the carry propagation loop limits so that div[qi] is normalized along with the other digits. Then we'd have a carry leftover at the end of the loop, but we could add it to outercarry. That makes the argument that div[qi] does not overflow the same as for the other digits. However, I kind of like your idea of postponing the div[qi] subtraction to the outercarry update, because then it's very direct to see that that update should result in near-total cancellation. > so outercarry ought to be driven towards zero, as the comment says. It > certainly doesn't look like it will grow without bounds, but I haven't > attempted to work out what its bounds actually are. I'm kind of stuck on that too. I did some experimentation by tracking maximum values of outercarry in the regression tests (including numeric_big) and did not see it get larger than a couple hundred thousand, ie more or less INT_MAX/NBASE. But I don't have an argument as to why that would be the limit. Another issue here is that with outercarry added into the qdigit computation, it's no longer clear what the bounds of qdigit itself are, so is it really safe to assume that "div[qi + i] -= qdigit * var2digits[i]" can't overflow? Stated in other terms, why are we sure that the new maxdiv value computed at the bottom of the carry propagation stanza is within the safe range? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
On 17 November 2015 at 23:57, Tom Lane wrote: > After thinking some more about what this is doing, it seems to me that > we could avoid changing div[qi + 1] at all here, and instead deal with > leftover dividend digits by shoving them into the floating-point part of > the calculation. All that we really need to have happen as a consequence > of the above is to inject an additional "div[qi] * NBASE" term into future > calculations of fdividend. So we can do that out-of-band and avoid > mucking up the maxdiv invariant. > That makes sense. > Also, I realized that this bit: > > /* > * All the div[] digits except possibly div[qi] are now in the > * range 0..NBASE-1. > */ > maxdiv = Abs(newdig) / (NBASE - 1); > maxdiv = Max(maxdiv, 1); > > is unduly conservative, because, while indeed div[qi] might be out of > range, there is no need to consider it in maxdiv anymore, because the new > maxdiv value only determines what will happen in future loop iterations > where the current div[qi] will be irrelevant. So we should just reset > maxdiv to 1 unconditionally here. > OK, makes sense too. > So that led me to the attached patch, which passes regression tests fine. I'd feel better about that if the regression tests actually did something that stressed this, but coming up with such a test is non-trivial. > I'm not sure that it's provably okay though. The loose ends are to show > that div[qi] can't overflow an int during the divisor-subtraction step > and that "outercarry" remains bounded. Testing suggests that outercarry > can't exceed INT_MAX/NBASE, but I don't see how to prove that. > At the top of the loop we know that Abs(div[qi]) <= maxdiv * (NBASE-1) <= INT_MAX - INT_MAX/NBASE - 1 Then we add Abs(qdigit) to maxdiv which potentially triggers a normalisation step. That step adds carry to div[qi], and we know that Abs(carry) <= INT_MAX/NBASE + 1 so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't overflow at that point, but it could be on the verge of doing so. Then the divisor-subtraction step does div[qi] -= qdigit * var2digits[0] That looks as though it could overflow, although actually I would expect qdigit to have the same sign as div[qi], so that this would drive div[qi] towards zero. If you didn't want to rely on that though, you could take advantage of the fact that this new value of div[qi] is only needed for outercarry, so you could modify the divisor-subtraction step to skip div[qi]: for (i = 1; i < istop; i++) div[qi + i] -= qdigit * var2digits[i]; and fold the most significant digit of the divisor-subtraction step into the computation of outercarry so that there is definitely no possibility of integer overflow: outercarry = outercarry * NBASE + (double) div[qi] - (double) qdigit * var2digits[0]; It's still difficult to prove that outercarry remains bounded, but to a first approximation qdigit ~= ( outercarry * NBASE + (double) div[qi] ) / var2digits[0] so outercarry ought to be driven towards zero, as the comment says. It certainly doesn't look like it will grow without bounds, but I haven't attempted to work out what its bounds actually are. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
I wrote: > I pushed this, but while looking at it, my attention was drawn to this > bit down near the end of the loop: > /* > * The dividend digit we are about to replace might still be nonzero. > * Fold it into the next digit position. We don't need to worry about > * overflow here since this should nearly cancel with the subtraction > * of the divisor. > */ > div[qi + 1] += div[qi] * NBASE; > In the first place, I'm not feeling especially convinced by that handwave > about there being no risk of overflow. In the second place, this is > indisputably failing to consider whether maxdiv might need to increase. > If I add > Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1))); > right after this, the regression tests blow up. So it looks to me like > we've got some more work to do. After thinking some more about what this is doing, it seems to me that we could avoid changing div[qi + 1] at all here, and instead deal with leftover dividend digits by shoving them into the floating-point part of the calculation. All that we really need to have happen as a consequence of the above is to inject an additional "div[qi] * NBASE" term into future calculations of fdividend. So we can do that out-of-band and avoid mucking up the maxdiv invariant. Also, I realized that this bit: /* * All the div[] digits except possibly div[qi] are now in the * range 0..NBASE-1. */ maxdiv = Abs(newdig) / (NBASE - 1); maxdiv = Max(maxdiv, 1); is unduly conservative, because, while indeed div[qi] might be out of range, there is no need to consider it in maxdiv anymore, because the new maxdiv value only determines what will happen in future loop iterations where the current div[qi] will be irrelevant. So we should just reset maxdiv to 1 unconditionally here. So that led me to the attached patch, which passes regression tests fine. I'm not sure that it's provably okay though. The loose ends are to show that div[qi] can't overflow an int during the divisor-subtraction step and that "outercarry" remains bounded. Testing suggests that outercarry can't exceed INT_MAX/NBASE, but I don't see how to prove that. regards, tom lane diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index dcdc5cf..03a646b 100644 *** a/src/backend/utils/adt/numeric.c --- b/src/backend/utils/adt/numeric.c *** div_var_fast(NumericVar *var1, NumericVa *** 6186,6192 double fdividend, fdivisor, fdivisorinverse, ! fquotient; int qi; int i; --- 6186,6193 double fdividend, fdivisor, fdivisorinverse, ! fquotient, ! outercarry; int qi; int i; *** div_var_fast(NumericVar *var1, NumericVa *** 6274,6289 * To avoid overflow in maxdiv itself, it represents the max absolute * value divided by NBASE-1, ie, at the top of the loop it is known that * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1). */ maxdiv = 1; /* ! * Outer loop computes next quotient digit, which will go into div[qi] */ for (qi = 0; qi < div_ndigits; qi++) { /* Approximate the current dividend value */ ! fdividend = (double) div[qi]; for (i = 1; i < 4; i++) { fdividend *= NBASE; --- 6275,6297 * To avoid overflow in maxdiv itself, it represents the max absolute * value divided by NBASE-1, ie, at the top of the loop it is known that * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1). + * + * Note that maxdiv only includes digit positions that are still part of + * the dividend, ie, strictly speaking the above holds only for div[i] + * where i >= qi, at the top of the loop. */ maxdiv = 1; /* ! * Outer loop computes next quotient digit, which will go into div[qi]. ! * "outercarry" represents high-order dividend digits carried across ! * iterations. */ + outercarry = 0; for (qi = 0; qi < div_ndigits; qi++) { /* Approximate the current dividend value */ ! fdividend = outercarry * NBASE + (double) div[qi]; for (i = 1; i < 4; i++) { fdividend *= NBASE; *** div_var_fast(NumericVar *var1, NumericVa *** 6320,6340 carry = 0; div[i] = newdig; } ! newdig = div[qi] + carry; ! div[qi] = newdig; /* * All the div[] digits except possibly div[qi] are now in the ! * range 0..NBASE-1. */ ! maxdiv = Abs(newdig) / (NBASE - 1); ! maxdiv = Max(maxdiv, 1); /* * Recompute the quotient digit since new info may have * propagated into the top four dividend digits */ ! fdividend = (double) div[qi]; for (i = 1; i < 4; i++) { fdividend *= NBASE; --- 6328,6347 carry = 0; div[i] = newdig; } ! d
Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > On 17 November 2015 at 14:43, Tom Lane wrote: >> Hm, good point. I don't feel a compulsion to have a test case that >> proves it's broken before we fix it. Do you want to send a patch? > OK, here it is. It's slightly different from mul_var, because maxdiv > is tracking absolute values and the carry may be positive or negative, > and it's absolute value may be as high as INT_MAX / NBASE + 1 (when > it's negative), but otherwise the principle is the same. I pushed this, but while looking at it, my attention was drawn to this bit down near the end of the loop: /* * The dividend digit we are about to replace might still be nonzero. * Fold it into the next digit position. We don't need to worry about * overflow here since this should nearly cancel with the subtraction * of the divisor. */ div[qi + 1] += div[qi] * NBASE; In the first place, I'm not feeling especially convinced by that handwave about there being no risk of overflow. In the second place, this is indisputably failing to consider whether maxdiv might need to increase. If I add Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1))); right after this, the regression tests blow up. So it looks to me like we've got some more work to do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
On 17 November 2015 at 14:43, Tom Lane wrote: > Dean Rasheed writes: >> I just noticed that div_var_fast() has almost identical code, and so >> in principle it has the same vulnerability, although it obviously only >> affects the transcendental functions. >> I don't actually have a test case that triggers it, but it's basically >> the same algorithm, so logically it needs the same additional headroom >> to avoid a possible overflow. > > Hm, good point. I don't feel a compulsion to have a test case that > proves it's broken before we fix it. Do you want to send a patch? > OK, here it is. It's slightly different from mul_var, because maxdiv is tracking absolute values and the carry may be positive or negative, and it's absolute value may be as high as INT_MAX / NBASE + 1 (when it's negative), but otherwise the principle is the same. Regards, Dean diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c new file mode 100644 index fe9f3f7..eaafcd9 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -6266,8 +6266,15 @@ div_var_fast(NumericVar *var1, NumericVa /* * maxdiv tracks the maximum possible absolute value of any div[] entry; * when this threatens to exceed INT_MAX, we take the time to propagate - * carries. To avoid overflow in maxdiv itself, it actually represents - * the max possible abs. value divided by NBASE-1. + * carries. Furthermore, we need to ensure that overflow doesn't occur + * during the carry propagation passes either. The carry values may have + * an absolute value as high as INT_MAX/NBASE + 1, so really we must + * normalize when digits threaten to exceed INT_MAX - INT_MAX/NBASE - 1. + * + * To avoid overflow in maxdiv itself, it actually represents the max + * possible absolute value divided by NBASE-1, ie, at the top of the loop + * it is known that no dig[] entry has an absolute value exceeding + * maxdiv * (NBASE-1). */ maxdiv = 1; @@ -6293,7 +6300,7 @@ div_var_fast(NumericVar *var1, NumericVa { /* Do we need to normalize now? */ maxdiv += Abs(qdigit); - if (maxdiv > INT_MAX / (NBASE - 1)) + if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1)) { /* Yes, do it */ carry = 0; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > I just noticed that div_var_fast() has almost identical code, and so > in principle it has the same vulnerability, although it obviously only > affects the transcendental functions. > I don't actually have a test case that triggers it, but it's basically > the same algorithm, so logically it needs the same additional headroom > to avoid a possible overflow. Hm, good point. I don't feel a compulsion to have a test case that proves it's broken before we fix it. Do you want to send a patch? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
On 21 September 2015 at 17:14, Tom Lane wrote: > Dean Rasheed writes: >> On 21 September 2015 at 16:09, Tom Lane wrote: >>> After trying to rework the comment to explain what maxdig really meant >>> after your changes, I came to the conclusion that it'd be better to do >>> it as per attached. Does this look sane to you? > >> Yes that looks better. It's still the same amount of extra headroom >> (21), but I think it's clearer your way. > > OK, pushed (after further hacking on the comment ...) > > regards, tom lane I just noticed that div_var_fast() has almost identical code, and so in principle it has the same vulnerability, although it obviously only affects the transcendental functions. I don't actually have a test case that triggers it, but it's basically the same algorithm, so logically it needs the same additional headroom to avoid a possible overflow. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > On 21 September 2015 at 16:09, Tom Lane wrote: >> After trying to rework the comment to explain what maxdig really meant >> after your changes, I came to the conclusion that it'd be better to do >> it as per attached. Does this look sane to you? > Yes that looks better. It's still the same amount of extra headroom > (21), but I think it's clearer your way. OK, pushed (after further hacking on the comment ...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
On 21 September 2015 at 16:09, Tom Lane wrote: > I wrote: >> Dean Rasheed writes: >>> The problem then arises in the final carry propagation pass. During >>> this phase of the computation, the carry from one digit (which can be >>> a shade under INT_MAX / NBASE) is added to the next digit, and that's >>> where the overflow happens. > >> Nice catch! I think the comment could use a little more work, but I'll >> adjust it and push. > > After trying to rework the comment to explain what maxdig really meant > after your changes, I came to the conclusion that it'd be better to do > it as per attached. Does this look sane to you? > Yes that looks better. It's still the same amount of extra headroom (21), but I think it's clearer your way. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in numeric multiplication
I wrote: > Dean Rasheed writes: >> The problem then arises in the final carry propagation pass. During >> this phase of the computation, the carry from one digit (which can be >> a shade under INT_MAX / NBASE) is added to the next digit, and that's >> where the overflow happens. > Nice catch! I think the comment could use a little more work, but I'll > adjust it and push. After trying to rework the comment to explain what maxdig really meant after your changes, I came to the conclusion that it'd be better to do it as per attached. Does this look sane to you? regards, tom lane diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 1bfa29e..d403554 100644 *** a/src/backend/utils/adt/numeric.c --- b/src/backend/utils/adt/numeric.c *** mul_var(NumericVar *var1, NumericVar *va *** 5789,5796 * to avoid normalizing carries immediately. * * maxdig tracks the maximum possible value of any dig[] entry; when this ! * threatens to exceed INT_MAX, we take the time to propagate carries. To ! * avoid overflow in maxdig itself, it actually represents the max * possible value divided by NBASE-1. */ dig = (int *) palloc0(res_ndigits * sizeof(int)); --- 5789,5801 * to avoid normalizing carries immediately. * * maxdig tracks the maximum possible value of any dig[] entry; when this ! * threatens to exceed INT_MAX, we take the time to propagate carries. ! * Furthermore, we need to ensure that overflow doesn't occur during the ! * carry propagation pass below either. The carry value could be as much ! * as INT_MAX/NBASE, so really we should normalize when digits threaten to ! * exceed INT_MAX - INT_MAX/NBASE. ! * ! * To avoid overflow in maxdig itself, it actually represents the max * possible value divided by NBASE-1. */ dig = (int *) palloc0(res_ndigits * sizeof(int)); *** mul_var(NumericVar *var1, NumericVar *va *** 5806,5812 /* Time to normalize? */ maxdig += var1digit; ! if (maxdig > INT_MAX / (NBASE - 1)) { /* Yes, do it */ carry = 0; --- 5811,5817 /* Time to normalize? */ maxdig += var1digit; ! if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1)) { /* Yes, do it */ carry = 0; diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out index e6ee548..c1886fd 100644 *** a/src/test/regress/expected/numeric.out --- b/src/test/regress/expected/numeric.out *** SELECT * FROM num_input_test; *** 1334,1339 --- 1334,1366 (7 rows) -- + -- Test some corner cases for multiplication + -- + select 4790 * ; + ?column? + -- + 4790999852090001 + (1 row) + + select 4789 * ; + ?column? + -- + 47885211 + (1 row) + + select 4770 * ; + ?column? + -- + 4770999852290001 + (1 row) + + select 476
Re: [HACKERS] Bug in numeric multiplication
Dean Rasheed writes: > The problem then arises in the final carry propagation pass. During > this phase of the computation, the carry from one digit (which can be > a shade under INT_MAX / NBASE) is added to the next digit, and that's > where the overflow happens. Nice catch! I think the comment could use a little more work, but I'll adjust it and push. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bug in numeric multiplication
Hi, By chance, while testing the nearby numeric log/exp/pow patch, I came across the following case which generates an initially puzzling looking error on HEAD -- (5.6-1e-500) ^ (3.2-1e-200). This computation actually works OK with that other patch, but only by blind luck. It turns out that the underlying problem is a bug in the low-level numeric multiplication function mul_var(). It is possible to trigger it directly with the following carefully crafted inputs: select 4790 * ; Result: 47909978523049530001 That answer is actually incorrect. Tweaking the input a little, it is possible to generate a much more obviously nonsensical result: select 4789 * ; Result: 478999785231+0,*0001 Notice those garbage digits in the middle of the number returned. The problem is that these examples trigger an overflow of the digits in the accumulator array in mul_var(). The number on the left in the first example consists of 21 copies of , preceded by 4790. Those are chosen so that when added together they lead to a value for maxdig in mul_var() of 21* + 4790 = 214769, which is exactly equal to INT_MAX / (NBASE - 1). So this doesn't quite trigger a normalisation of the accumulator array, and leaves several of the digits in that array a little under INT_MAX at the end of the main multiplication loop. The problem then arises in the final carry propagation pass. During this phase of the computation, the carry from one digit (which can be a shade under INT_MAX / NBASE) is added to the next digit, and that's where the overflow happens. To fix that, the initial value for maxdig needs to be made larger to leave headroom for the carry. The largest possible carry is INT_MAX / NBASE, and maxdig is the maximum possible dig value divided by NBASE-1, so maxdig needs to be initialised to (INT_MAX / NBASE) / (NBASE - 1) which is 21 for NBASE = 1. A new corner-case input that doesn't quite trigger an accumulator normalisation is then 4769... The worst case inputs are now values like this for which the sum of a sequence of input digits is INT_MAX / (NBASE - 1) - 21 = 214769 - 21 = 214748. So in the worst case, the accumulator's digits can be up to 214748 * = 2147265252 in the main multiplication loop. Then, during the carry propagation phase (or any of the normalisation phases), the carry can be anything up to INT_MAX / NBASE = 214748. So the maximum value that can be assigned to any individual digit is now 2147265252 + 214748 = 214748, which is now less than INT_MAX. Patch attached. Regards, Dean diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c new file mode 100644 index 1bfa29e..4b39c7a --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -5792,9 +5792,13 @@ mul_var(NumericVar *var1, NumericVar *va * threatens to exceed INT_MAX, we take the time to propagate carries. To * avoid overflow in maxdig itself, it actually represents the max * possible value divided by NBASE-1. + * + * Note that the carry propagation steps may carry as much as INT_MAX/NBASE + * from one digit to the next, so we have to leave headroom for that as + * well. */ dig = (int *) palloc0(res_ndigits * sizeof(int)); - maxdig = 0; + maxdig = (INT_MAX / NBASE) / (NBASE - 1); ri = res_ndigits - 1; for (i1 = var1ndigits - 1; i1 >= 0; ri--, i1--) @@ -5824,7 +5828,7 @@ mul_var(NumericVar *var1, NumericVar *va } Assert(carry == 0); /* Reset maxdig to indicate new worst-case */ - maxdig = 1 + var1digit; + maxdig = 1 + var1digit + (INT_MAX / NBASE) / (NBASE - 1); } /* Add appropriate multiple of var2 into the accumulator */ diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out new file mode 100644 index e6ee548..c1886fd --- a/src/test/regress/expected/numeric.out +++ b/src/test/regress/expected/numeric.out @@ -1334,6 +1334,33 @@ SELECT * FROM num_input_test; (7 rows) -- +-- Test some corner cases for multiplication +-- +select 4790 * ; + ?column?