[Bug middle-end/120378] Support narrowing clip idiom

2025-06-24 Thread Dusan.Stojkovic--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378

Dusan Stojkovic  changed:

   What|Removed |Added

 CC||dusan.stojko...@rt-rk.com

--- Comment #6 from Dusan Stojkovic  ---
Here are some examples from inter_pred_filter functions in x265:
https://godbolt.org/z/7asjsqcj8

They are generalized to show different variations to consider. The examples
show how introducing a temporary 
variable before storing the result of the clipping produces:
```
vmsge.viv0,v1,0
vsetvli zero,zero,e8,mf2,ta,mu
vnsrl.wiv6,v1,0,v0.t
vsetvli zero,zero,e16,m1,ta,ma
vmsle.vvv0,v1,v5
vsetvli zero,zero,e8,mf2,ta,ma
vmerge.vvm  v1,v3,v6,v0
```

Shouldn't both cases generate the vmax/vmin/truncate pattern at least?

Curiously, when doing a signed clip, there are two different approaches taken
by GCC; this time the choice doesn't involve
the type of store, but rather the size difference between the type being
clipped and the resulting type.

There is a case where at the end with two functions which could be optimized
with:
```
...
csrwi   vxrm,0
...
vnclipu.wi  v1,v1,6
...
```
Here GCC chooses vmax/vmin/truncate regardless of introducing a temporary
variable or not.

[Bug middle-end/120378] Support narrowing clip idiom

2025-05-23 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378

--- Comment #5 from Richard Biener  ---
(In reply to Robin Dapp from comment #4)
> Does it make sense to have the vmax/vmin/truncate pattern as a fallback for
> other targets?  On riscv it would save one predicated instruction.

I think if targets decide that they could implement the optab in this way,
so I'd not do this initially.

[Bug middle-end/120378] Support narrowing clip idiom

2025-05-23 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378

--- Comment #3 from Robin Dapp  ---
vnclipu is basically a scaling (narrowing), rounding shift with subsequent
"clip" i.e. saturation.  Its input and output is unsigned, though, so for the
function above we first need to "clip" the negative values to 0 and then shift
twice from unsigned int to unsigned char.

So far I'm only aware of the vector insn but I think there are discussions
about a scalar one for a future extension.

There's also vnclip (signed -> signed).

An alternative to vnclipu would be vmax (vmin (...)) but then we'd still need
to truncate the result 2x.  Truncations are narrowing shifts as well, meaning
we'd need 4 instructions instead of 3.

[Bug middle-end/120378] Support narrowing clip idiom

2025-05-23 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378

--- Comment #4 from Robin Dapp  ---
Does it make sense to have the vmax/vmin/truncate pattern as a fallback for
other targets?  On riscv it would save one predicated instruction.

[Bug middle-end/120378] Support narrowing clip idiom

2025-05-23 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378

--- Comment #2 from Richard Biener  ---
So what does vnclipu do?  But yes, the way to fix is to add an optab for this,
a vectorizer pattern and/or a match rule (in case that insn is a thing for
non-vector as well).

[Bug middle-end/120378] Support narrowing clip idiom

2025-05-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378

--- Comment #1 from Hongtao Liu  ---

> The ifcvt'ed code before vect is:
> 
>   _4 = *_3;
>   x.0_12 = (unsigned int) _4;
>   _38 = -x.0_12;
>   _15 = (int) _38;
>   _16 = _15 >> 31;
>   _29 = x.0_12 > 255;
>   _17 = _29 ? _16 : _4;
>   _18 = (unsigned char) _17;
> 

For the testcase in PR, I think x.0_12 > 255 must be false since it's
zero_extend from unsigned char. So the comparison can be optimized off?