Mikolaj-A-Kowalski wrote:

Hello! 

First I would like to apologise for barging in into a PR completely unrelated 
to me but recently I have been doing some work on the optimisation of the 
reciprocal and division on the Nvidia hardware and I couldn’t help but to be 
fascinated by this work :-) 

I have played a bit with the final version of the `rcp` developed here and I 
think we can improve it slightly. Basically at the cost of one extra `fma` we 
can make it correctly rounded for all inputs. Hopefully it will not be too 
painful for the performance since we can remove some instructions related to 
scaling.  Perhaps it will also help with the 1ULP annotation problem since with 
the changes the implementation appears to correctly rounded.

I am totally unfamiliar with LLVM (sadly... yet...) so I wasn't really able to 
use the tooling here and had to rely on julia with which I am more familiar to. 
I have attached the code:  
[code.zip](https://github.com/user-attachments/files/29038343/code.zip)

So the first optimisation we can do is to change how we scale the input. Since 
the 'normal' FP values also go through scaling, there is no reason to use 
factor of 1.0 for them. Any other power of 2 would work. Hence we don't need to 
split the range into three parts, but two. In the attached code I basically 
shifted all values with the absolute value below 1.0 by 2^30 and above 2^(-30) 
(With the power being quite arbitrary, it just need to be sufficiently large  
to pull all subnormals into the 'normal' binades and the numbers from the top 
of the range away from the underflow). This should save us one comparison and 1 
conditional move. 

To remove the 1ULP error we need to do two things. First we need to change when 
we apply the correction iteration to get the correct rounding. We can do it 
after scaling the number back. If we do that we can reduce the number of +ve 
floating-point numbers with 1ULP error to 5. 

The second thing to do in order to remove the error is to which from the 
quadratic (Newton) iteration to the cubic iteration. This costs additional 
`fma` but in my tests it seemed to correctly round the remaining edge-cases.  

At the end, we end up with the assembly (produced by julia with AMDGPU package) 
along the lines of: 
```
        v_mov_b32_e32 v0, 0x4e800000
        v_mov_b32_e32 v1, 0x30800000
        v_mov_b32_e32 v2, 0
        s_waitcnt lgkmcnt(0)
        s_load_dword s2, s[0:1], 0x0
        s_waitcnt lgkmcnt(0)
        v_cmp_nle_f32_e64 vcc, |s2|, 1.0
        s_nop 1
        v_cndmask_b32_e32 v0, v0, v1, vcc
        v_mul_f32_e32 v1, s2, v0
        v_rcp_f32_e32 v1, v1
        s_nop 0
        v_mul_f32_e32 v0, v0, v1
        v_fma_f32 v1, v0, -s2, 1.0
        v_fmac_f32_e32 v1, v1, v1
        v_fmac_f32_e32 v0, v1, v0
        v_div_fixup_f32 v0, v0, s2, 1.0
```

I don't have any benchmarks on hand so I cannot say if it is any faster than 
standard division route. I just hope that this version may be of interest ;-) 

It is a bit of the interesting question why the quadratic iteration fails to 
correctly round the 1ULP error in the subnormal range... I guess it may have to 
do something with the error in the estimation of the `e` term. My guess is that 
we can tolerate this error with cubic iteration since we are 'pushing down' the 
error before the final approximation much more (O(e^3) vs O(e^2)).  As a side 
note, a similar cubic correction step is done in the slow-path of the Nvidia's 
rcp implementation (https://godbolt.org/z/bqne9vY5G) 

<details>
<summary> Appendix: cubic rcp iteration</summary>

Following the [ P. Markstein 
book](https://www.semanticscholar.org/paper/IA-64-and-elementary-functions-speed-and-precision-Markstein/66cb19bd3d894c114d3743ffaa5b5cff179032fa)
 

we can define an error of an approximation `r` to `1/a` as: 
```math 
e = r * 1/a -1 
```
Also, by definition and binominal expansion we can write: 
```math
1/a = \frac{r} {e + 1} = r ( 1 + e + e^2 + \cdots)
```
We can use that to derive diffrent order iteration schemes, by taking first two 
terms, we obtain quadratic Newton iteration, by taking three terms, we get a 
cubic iteration. 



</details>

https://github.com/llvm/llvm-project/pull/194716
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to