sepavloff added inline comments.
================ Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:99 + // 0 < x0(b) < 1 + // abs(x0(b) - 1/b) <= 3/4 - 1/sqrt(2) + ---------------- atrosinenko wrote: > sepavloff wrote: > > This estimation is absent from the original comment. Do you have reference > > where it came from? Also the original comment states `This is accurate to > > about 3.5 binary digits`. Is still true? If yes, it could be worth copying > > here. > This approximation was deduced by writing down the derivative of `f ` "in > infinite precision" and finding its root. Then the values of `f` applied to > its root, 1.0 and 2.0 were calculated -- as far as I remember, **all of > them** were `3/4 - 1/sqrt(2)` or negated - //this is what "minimax > polynomial" probably means, that term was just copied from the original > implementation :)//. IIUC, you don't want to put this statement here because you are us not sure it is true? Sounds reasonable. ================ Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:109 + +#if NUMBER_OF_HALF_ITERATIONS > 0 + // Starting with (n-1) half-width iterations ---------------- atrosinenko wrote: > sepavloff wrote: > > It is good optimization. Could you please put a comment shortly describing > > the idea of using half-sized temporaries? > The idea is just "I guess this takes less CPU time and I have managed to > prove error bounds for it". :) Specifically, for float128, the rep_t * rep_t > multiplication will be emulated with lots of CPU instructions while the lower > half contain some noise at that point. This particular optimization did exist > in the original implementation for float64 and float128. For float32 it had > not much sense, I guess. Still, estimations were calculated for the case of > float32 with half-size iterations as it may be useful for MSP430 and other > 16-bit targets. The idea is clear but it require some study of the sources. I would propose to add a comment saying: ``` At the first iterations number of significant digits is small, so we may use shorter type values. Operations on them are usually faster. ``` or something like that. ================ Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:130 + // Cannot overflow by construction and is larger than 2 - b*x by at most 1*Ulp. + half_rep_t corr_UQ1_hw = 0 /* = 2 */ - ((rep_t)x_UQ0_hw * b_UQ1_hw >> HW); + // Would be at most [1.]00000 after overflow if rounding (0 - x_UQ0_hw * b_UQ1_hw) down. ---------------- atrosinenko wrote: > sepavloff wrote: > > It would be better to put short comment to explain using 0 instead of 2. > Agree, it was expected to be something like `/* = 2.0 in UQ1.(HW-1) */`. > Naming things is especially painful here... 2.0 cannot be represented in UQ1.X. I would add comment line like: ``` Due to wrapping 2.0 in UQ1.X is equivalent to 0. ``` or something similar. ================ Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:184 + // x_UQ0 * b_UQ1 = (x_UQ0_hw * 2^HW) * (b_UQ1_hw * 2^HW + blo) - b_UQ1 + rep_t corr_UQ1 = 0 - ( (rep_t)x_UQ0_hw * b_UQ1_hw + + ((rep_t)x_UQ0_hw * blo >> HW) ---------------- atrosinenko wrote: > sepavloff wrote: > > `x_UQ0_hw` and `b_UQ1_hw` are declared inside the conditional block `#if > > NUMBER_OF_HALF_ITERATIONS > 0`. Does `NUMBER_OF_FULL_ITERATIONS != 1` > > always imply `NUMBER_OF_HALF_ITERATIONS > 0 `? > > Does NUMBER_OF_FULL_ITERATIONS != 1 always imply NUMBER_OF_HALF_ITERATIONS > > > 0 ? > > Hmm... It should imply `== 0`, at first glance... Generally, total number of > iterations should be 3 for f32, 4 for f64 and 5 for f128. Then, error bounds > are calculated. There are generally only two modes: n-1 half-size iteration + > 1 full-size iteration OR n full-size iteration (as one generally has no > performance gains from using 16x16-bit multiplications on one hand, and that > particular case turned out to require extra rounding, on the other). I have concern that `x_UQ0_hw` and `x_UQ1_hw` are declared in the block with condition `NUMBER_OF_HALF_ITERATIONS > 0` but uses in the other block with condition `!USE_NATIVE_FULL_ITERATIONS && NUMBER_OF_HALF_ITERATIONS > 0`, so probably there may be a combination of the macros when the variables are used but not declared. Maybe it is impossible due to some reason, in this case a proper check may be put into the block `#ifdef USE_NATIVE_FULL_ITERATIONS` which asserts that `NUMBER_OF_HALF_ITERATIONS > 0`. Otherwise `x_UQ0_hw` and `x_UQ1_hw` need to be moved out of the conditional block. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D85031/new/ https://reviews.llvm.org/D85031 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits