carlobertolli wrote:

> > That's right: +-0 and +-inf is still handled by full division.
> 
> v_rcp itself handles these cases correctly. Only the NR fixup messes them up. 
> So how about a structure like:
> 
> ```
> y = rcp(x);
> if (x not 0 or inf) {
>   do the NR fixup
> }
> ```
> 
> This avoids having the complete 12-instruction fdiv expansion bloating the 
> code.
> 
> > I fell for the "let's remove the if-then-else and handle everything with 
> > selects!" idea a couple of times already. When I do that I see a 
> > performance regression in arithmetic intensive kernel (~30%): full division 
> > is more efficient that Newton-Raphson + a bunch of selects.
> 
> What architecture are you benchmarking on? In the codegen with selects 
> (gfx1010) I see:
> 
> ```
>       v_cmp_lt_f32_e64 s1, 0x7e800000, |s0|
>       v_cmp_ngt_f32_e64 vcc_lo, 0x800000, |s0|
>       v_cndmask_b32_e64 v0, 1.0, 0x2f800000, s1
>       v_cndmask_b32_e32 v0, 0x4f800000, v0, vcc_lo
> ```
> 
> This is suspicious because GFX10+ has a fast path for v_cmp immediately 
> followed by v_cndmask, so I would expect this code to be ordered more like:
> 
> ```
>       v_cmp_lt_f32_e64 vcc_lo, 0x7e800000, |s0|
>       v_cndmask_b32_e64 v0, 1.0, 0x2f800000, vcc_lo
>       v_cmp_ngt_f32_e64 vcc_lo, 0x800000, |s0|
>       v_cndmask_b32_e32 v0, 0x4f800000, v0, vcc_lo
> ```
> 
> > 1ULP for normals, 0 for everything else.
> 
> Isn't there some existing metadata or similar that specifies the max ULP 
> error the user will tolerate? You could piggy back on that instead of 
> inventing a new command line option to control this.

I am running experiment on an MI350.

https://github.com/llvm/llvm-project/pull/194716
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to