carlobertolli wrote:
> > That's right: +-0 and +-inf is still handled by full division.
>
> v_rcp itself handles these cases correctly. Only the NR fixup messes them up.
> So how about a structure like:
>
> ```
> y = rcp(x);
> if (x not 0 or inf) {
> do the NR fixup
> }
> ```
>
> This avoids having the complete 12-instruction fdiv expansion bloating the
> code.
>
> > I fell for the "let's remove the if-then-else and handle everything with
> > selects!" idea a couple of times already. When I do that I see a
> > performance regression in arithmetic intensive kernel (~30%): full division
> > is more efficient that Newton-Raphson + a bunch of selects.
>
> What architecture are you benchmarking on? In the codegen with selects
> (gfx1010) I see:
>
> ```
> v_cmp_lt_f32_e64 s1, 0x7e800000, |s0|
> v_cmp_ngt_f32_e64 vcc_lo, 0x800000, |s0|
> v_cndmask_b32_e64 v0, 1.0, 0x2f800000, s1
> v_cndmask_b32_e32 v0, 0x4f800000, v0, vcc_lo
> ```
>
> This is suspicious because GFX10+ has a fast path for v_cmp immediately
> followed by v_cndmask, so I would expect this code to be ordered more like:
>
> ```
> v_cmp_lt_f32_e64 vcc_lo, 0x7e800000, |s0|
> v_cndmask_b32_e64 v0, 1.0, 0x2f800000, vcc_lo
> v_cmp_ngt_f32_e64 vcc_lo, 0x800000, |s0|
> v_cndmask_b32_e32 v0, 0x4f800000, v0, vcc_lo
> ```
>
> > 1ULP for normals, 0 for everything else.
>
> Isn't there some existing metadata or similar that specifies the max ULP
> error the user will tolerate? You could piggy back on that instead of
> inventing a new command line option to control this.
I am running experiment on an MI350.
https://github.com/llvm/llvm-project/pull/194716
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits