[PATCH] D85031: [builtins] Unify the softfloat division implementation

Serge Pavlov via Phabricator via cfe-commits Thu, 20 Aug 2020 22:10:38 -0700

sepavloff added inline comments.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:99
+  //   0 < x0(b) < 1
+  //   abs(x0(b) - 1/b) <= 3/4 - 1/sqrt(2)
+
----------------
atrosinenko wrote:
> sepavloff wrote:
> > This estimation is absent from the original comment. Do you have reference 
> > where it came from? Also the original comment states `This is accurate to 
> > about 3.5 binary digits`. Is still true? If yes, it could be worth copying 
> > here.
> This approximation was deduced by writing down the derivative of `f ` "in 
> infinite precision" and finding its root. Then the values of `f` applied to 
> its root, 1.0 and 2.0 were calculated -- as far as I remember, **all of 
> them** were `3/4 - 1/sqrt(2)` or negated - //this is what "minimax 
> polynomial" probably means, that term was just copied from the original 
> implementation :)//.
IIUC, you don't want to put this statement here because you are us not sure it 
is true? Sounds reasonable.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:109
+
+#if NUMBER_OF_HALF_ITERATIONS > 0
+  // Starting with (n-1) half-width iterations
----------------
atrosinenko wrote:
> sepavloff wrote:
> > It is good optimization. Could you please put a comment shortly describing 
> > the idea of using half-sized temporaries?
> The idea is just "I guess this takes less CPU time and I have managed to 
> prove error bounds for it". :) Specifically, for float128, the rep_t * rep_t 
> multiplication will be emulated with lots of CPU instructions while the lower 
> half contain some noise at that point. This particular optimization did exist 
> in the original implementation for float64 and float128. For float32 it had 
> not much sense, I guess. Still, estimations were calculated for the case of 
> float32 with half-size iterations as it may be useful for MSP430 and other 
> 16-bit targets.
The idea is clear but it require some study of the sources. I would propose to 
add a comment saying:
```
At the first iterations number of significant digits is small, so we may use 
shorter type values. Operations on them are usually faster.
```
or something like that.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:130
+    // Cannot overflow by construction and is larger than 2 - b*x by at most 
1*Ulp.
+    half_rep_t corr_UQ1_hw = 0 /* = 2 */ - ((rep_t)x_UQ0_hw * b_UQ1_hw >> HW);
+    // Would be at most [1.]00000 after overflow if rounding (0 - x_UQ0_hw * 
b_UQ1_hw) down.
----------------
atrosinenko wrote:
> sepavloff wrote:
> > It would be better to put short comment to explain using 0 instead of 2.
> Agree, it was expected to be something like `/* = 2.0 in UQ1.(HW-1) */`. 
> Naming things is especially painful here...
2.0 cannot be represented in UQ1.X. I would add comment line like:
```
Due to wrapping 2.0 in UQ1.X is equivalent to 0.
```
or something similar.


================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:184
+  // x_UQ0 * b_UQ1 = (x_UQ0_hw * 2^HW) * (b_UQ1_hw * 2^HW + blo) - b_UQ1
+  rep_t corr_UQ1 = 0 - (   (rep_t)x_UQ0_hw * b_UQ1_hw
+                        + ((rep_t)x_UQ0_hw * blo >> HW)
----------------
atrosinenko wrote:
> sepavloff wrote:
> > `x_UQ0_hw` and `b_UQ1_hw` are declared inside the conditional block `#if 
> > NUMBER_OF_HALF_ITERATIONS > 0`. Does `NUMBER_OF_FULL_ITERATIONS != 1` 
> > always imply `NUMBER_OF_HALF_ITERATIONS > 0 `?
> > Does NUMBER_OF_FULL_ITERATIONS != 1 always imply NUMBER_OF_HALF_ITERATIONS 
> > > 0 ?
> 
> Hmm... It should imply `== 0`, at first glance... Generally, total number of 
> iterations should be 3 for f32, 4 for f64 and 5 for f128. Then, error bounds 
> are calculated. There are generally only two modes: n-1 half-size iteration + 
> 1 full-size iteration OR n full-size iteration (as one generally has no 
> performance gains from using 16x16-bit multiplications on one hand, and that 
> particular case turned out to require extra rounding, on the other).
I have concern that `x_UQ0_hw` and `x_UQ1_hw` are declared in the block with 
condition `NUMBER_OF_HALF_ITERATIONS > 0` but uses in the other block with 
condition `!USE_NATIVE_FULL_ITERATIONS && NUMBER_OF_HALF_ITERATIONS > 0`, so 
probably there may be a combination of the macros when the variables are used 
but not declared. Maybe it is impossible due to some reason, in this case a 
proper check may be put into the block `#ifdef USE_NATIVE_FULL_ITERATIONS` 
which asserts that `NUMBER_OF_HALF_ITERATIONS > 0`. Otherwise `x_UQ0_hw` and 
`x_UQ1_hw` need to be moved out of the conditional block.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85031/new/

https://reviews.llvm.org/D85031

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D85031: [builtins] Unify the softfloat division implementation

Reply via email to