[PATCH] D85031: [builtins] Unify the softfloat division implementation

Anatoly Trosinenko via Phabricator via cfe-commits Thu, 20 Aug 2020 10:23:25 -0700

atrosinenko added a comment.

//Some general context:// The final goal was to have an explanation why this 
particular number of iteration (3, 4 or 5, depending on type) are enough for 
any `a` and `b` passed as input arguments taking into account errors due to 
particular finite precision computations. Initially, I was trying to just 
"mechanically" unify the three implementations and their comments like this 
<https://github.com/access-softek/llvm-project/pull/24>. After trying for a 
while, turned out I cannot simply pick the original proof up and add the 
subnormal case. Some of statements were too vague, some seemed not explained 
enough, etc. Then, an attempt was made to re-prove it from scratch with an 
intention for the resulting explanation to be as much self-contained as 
possible. The **implementation**, on the other hand, gathers various features 
of three original functions and some hacks to make it easier to prove.




================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:99
+  //   0 < x0(b) < 1
+  //   abs(x0(b) - 1/b) <= 3/4 - 1/sqrt(2)
+
----------------
sepavloff wrote:
> This estimation is absent from the original comment. Do you have reference 
> where it came from? Also the original comment states `This is accurate to 
> about 3.5 binary digits`. Is still true? If yes, it could be worth copying 
> here.
This approximation was deduced by writing down the derivative of `f ` "in 
infinite precision" and finding its root. Then the values of `f` applied to its 
root, 1.0 and 2.0 were calculated -- as far as I remember, **all of them** were 
`3/4 - 1/sqrt(2)` or negated - //this is what "minimax polynomial" probably 
means, that term was just copied from the original implementation :)//.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:101-102
+
+  // Then, refine the reciprocal estimate using a Newton-Raphson iteration:
+  //     x_{n+1} = x_n * (2 - x_n * b)
+  // Let e_n := x_n - 1/b_hw
----------------
sepavloff wrote:
> The original comment states:
> ```
>   // This doubles the number of correct binary digits in the approximation
>   // with each iteration.
> ```
> It is true in this implementation? If yes, it could be worth copying here.
For me this looks too vague. This is probably //approximately true// but I 
don't know how exactly this should be interpreted.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:109
+
+#if NUMBER_OF_HALF_ITERATIONS > 0
+  // Starting with (n-1) half-width iterations
----------------
sepavloff wrote:
> It is good optimization. Could you please put a comment shortly describing 
> the idea of using half-sized temporaries?
The idea is just "I guess this takes less CPU time and I have managed to prove 
error bounds for it". :) Specifically, for float128, the rep_t * rep_t 
multiplication will be emulated with lots of CPU instructions while the lower 
half contain some noise at that point. This particular optimization did exist 
in the original implementation for float64 and float128. For float32 it had not 
much sense, I guess. Still, estimations were calculated for the case of float32 
with half-size iterations as it may be useful for MSP430 and other 16-bit 
targets.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:114
+  // C is (3/4 + 1/sqrt(2)) - 1 truncated to W0 fractional bits as UQ0.HW
+#if defined(SINGLE_PRECISION)
+  const half_rep_t C_hw = HALF_REP_C(0x7504) << (HW - 16);
----------------
sepavloff wrote:
> In what cases 16-bit temporaries are used?  `NUMBER_OF_HALF_ITERATIONS` is 
> set to zero in `divsf3.c`.
Agree, this needs to be re-evaluated and some comment should be added at least. 
This could be dead code for now, that was //expected// to speed things up on 
16-bit targets that even lack hardware multiplication sometimes (such as 
MSP430).


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:130
+    // Cannot overflow by construction and is larger than 2 - b*x by at most 
1*Ulp.
+    half_rep_t corr_UQ1_hw = 0 /* = 2 */ - ((rep_t)x_UQ0_hw * b_UQ1_hw >> HW);
+    // Would be at most [1.]00000 after overflow if rounding (0 - x_UQ0_hw * 
b_UQ1_hw) down.
----------------
sepavloff wrote:
> It would be better to put short comment to explain using 0 instead of 2.
Agree, it was expected to be something like `/* = 2.0 in UQ1.(HW-1) */`. Naming 
things is especially painful here...


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:184
+  // x_UQ0 * b_UQ1 = (x_UQ0_hw * 2^HW) * (b_UQ1_hw * 2^HW + blo) - b_UQ1
+  rep_t corr_UQ1 = 0 - (   (rep_t)x_UQ0_hw * b_UQ1_hw
+                        + ((rep_t)x_UQ0_hw * blo >> HW)
----------------
sepavloff wrote:
> `x_UQ0_hw` and `b_UQ1_hw` are declared inside the conditional block `#if 
> NUMBER_OF_HALF_ITERATIONS > 0`. Does `NUMBER_OF_FULL_ITERATIONS != 1` always 
> imply `NUMBER_OF_HALF_ITERATIONS > 0 `?
> Does NUMBER_OF_FULL_ITERATIONS != 1 always imply NUMBER_OF_HALF_ITERATIONS > 
> 0 ?

Hmm... It should imply `== 0`, at first glance... Generally, total number of 
iterations should be 3 for f32, 4 for f64 and 5 for f128. Then, error bounds 
are calculated. There are generally only two modes: n-1 half-size iteration + 1 
full-size iteration OR n full-size iteration (as one generally has no 
performance gains from using 16x16-bit multiplications on one hand, and that 
particular case turned out to require extra rounding, on the other).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85031/new/

https://reviews.llvm.org/D85031

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D85031: [builtins] Unify the softfloat division implementation

Reply via email to