[clang] [llvm] [AMDGPU] Use 1-iteration Newton-Raphson refinement for FP32 reciprocal. (PR #194716)

Mikolaj Adam Kowalski via cfe-commits Wed, 17 Jun 2026 01:53:35 -0700

Mikolaj-A-Kowalski wrote:

Hello!

First I would like to apologise for barging in into a PR completely unrelated
to me but recently I have been doing some work on the optimisation of the
reciprocal and division on the Nvidia hardware and I couldn’t help but to be
fascinated by this work :-)

I have played a bit with the final version of the `rcp` developed here and I
think we can improve it slightly. Basically at the cost of one extra `fma` we
can make it correctly rounded for all inputs. Hopefully it will not be too
painful for the performance since we can remove some instructions related to
scaling. Perhaps it will also help with the 1ULP annotation problem since with
the changes the implementation appears to correctly rounded.

I am totally unfamiliar with LLVM (sadly... yet...) so I wasn't really able to
use the tooling here and had to rely on julia with which I am more familiar to.
I have attached the code:
[code.zip](https://github.com/user-attachments/files/29038343/code.zip)

So the first optimisation we can do is to change how we scale the input. Since
the 'normal' FP values also go through scaling, there is no reason to use
factor of 1.0 for them. Any other power of 2 would work. Hence we don't need to
split the range into three parts, but two. In the attached code I basically
shifted all values with the absolute value below 1.0 by 2^30 and above 2^(-30)
(With the power being quite arbitrary, it just need to be sufficiently large
to pull all subnormals into the 'normal' binades and the numbers from the top
of the range away from the underflow). This should save us one comparison and 1
conditional move.

To remove the 1ULP error we need to do two things. First we need to change when
we apply the correction iteration to get the correct rounding. We can do it
after scaling the number back. If we do that we can reduce the number of +ve
floating-point numbers with 1ULP error to 5.

The second thing to do in order to remove the error is to which from the
quadratic (Newton) iteration to the cubic iteration. This costs additional
`fma` but in my tests it seemed to correctly round the remaining edge-cases.

At the end, we end up with the assembly (produced by julia with AMDGPU package)
along the lines of:
```
v_mov_b32_e32 v0, 0x4e800000
v_mov_b32_e32 v1, 0x30800000
v_mov_b32_e32 v2, 0
s_waitcnt lgkmcnt(0)
s_load_dword s2, s[0:1], 0x0
s_waitcnt lgkmcnt(0)
v_cmp_nle_f32_e64 vcc, |s2|, 1.0
s_nop 1
v_cndmask_b32_e32 v0, v0, v1, vcc
v_mul_f32_e32 v1, s2, v0
v_rcp_f32_e32 v1, v1
s_nop 0
v_mul_f32_e32 v0, v0, v1
v_fma_f32 v1, v0, -s2, 1.0
v_fmac_f32_e32 v1, v1, v1
v_fmac_f32_e32 v0, v1, v0
v_div_fixup_f32 v0, v0, s2, 1.0
```

I don't have any benchmarks on hand so I cannot say if it is any faster than
standard division route. I just hope that this version may be of interest ;-)

It is a bit of the interesting question why the quadratic iteration fails to
correctly round the 1ULP error in the subnormal range... I guess it may have to
do something with the error in the estimation of the `e` term. My guess is that
we can tolerate this error with cubic iteration since we are 'pushing down' the
error before the final approximation much more (O(e^3) vs O(e^2)). As a side
note, a similar cubic correction step is done in the slow-path of the Nvidia's
rcp implementation (https://godbolt.org/z/bqne9vY5G)

<details>
<summary> Appendix: cubic rcp iteration</summary>

Following the [ P. Markstein
book](https://www.semanticscholar.org/paper/IA-64-and-elementary-functions-speed-and-precision-Markstein/66cb19bd3d894c114d3743ffaa5b5cff179032fa)

we can define an error of an approximation `r` to `1/a` as:
```math
e = r * 1/a -1
```
Also, by definition and binominal expansion we can write:
```math
1/a = \frac{r} {e + 1} = r ( 1 + e + e^2 + \cdots)
```
We can use that to derive diffrent order iteration schemes, by taking first two
terms, we obtain quadratic Newton iteration, by taking three terms, we get a
cubic iteration.

</details>

https://github.com/llvm/llvm-project/pull/194716
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Use 1-iteration Newton-Raphson refinement for FP32 reciprocal. (PR #194716)

Reply via email to