[Bug middle-end/120378] Support narrowing clip idiom
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378 Dusan Stojkovic changed: What|Removed |Added CC||dusan.stojko...@rt-rk.com --- Comment #6 from Dusan Stojkovic --- Here are some examples from inter_pred_filter functions in x265: https://godbolt.org/z/7asjsqcj8 They are generalized to show different variations to consider. The examples show how introducing a temporary variable before storing the result of the clipping produces: ``` vmsge.viv0,v1,0 vsetvli zero,zero,e8,mf2,ta,mu vnsrl.wiv6,v1,0,v0.t vsetvli zero,zero,e16,m1,ta,ma vmsle.vvv0,v1,v5 vsetvli zero,zero,e8,mf2,ta,ma vmerge.vvm v1,v3,v6,v0 ``` Shouldn't both cases generate the vmax/vmin/truncate pattern at least? Curiously, when doing a signed clip, there are two different approaches taken by GCC; this time the choice doesn't involve the type of store, but rather the size difference between the type being clipped and the resulting type. There is a case where at the end with two functions which could be optimized with: ``` ... csrwi vxrm,0 ... vnclipu.wi v1,v1,6 ... ``` Here GCC chooses vmax/vmin/truncate regardless of introducing a temporary variable or not.
[Bug middle-end/120378] Support narrowing clip idiom
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378 --- Comment #5 from Richard Biener --- (In reply to Robin Dapp from comment #4) > Does it make sense to have the vmax/vmin/truncate pattern as a fallback for > other targets? On riscv it would save one predicated instruction. I think if targets decide that they could implement the optab in this way, so I'd not do this initially.
[Bug middle-end/120378] Support narrowing clip idiom
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378 --- Comment #3 from Robin Dapp --- vnclipu is basically a scaling (narrowing), rounding shift with subsequent "clip" i.e. saturation. Its input and output is unsigned, though, so for the function above we first need to "clip" the negative values to 0 and then shift twice from unsigned int to unsigned char. So far I'm only aware of the vector insn but I think there are discussions about a scalar one for a future extension. There's also vnclip (signed -> signed). An alternative to vnclipu would be vmax (vmin (...)) but then we'd still need to truncate the result 2x. Truncations are narrowing shifts as well, meaning we'd need 4 instructions instead of 3.
[Bug middle-end/120378] Support narrowing clip idiom
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378 --- Comment #4 from Robin Dapp --- Does it make sense to have the vmax/vmin/truncate pattern as a fallback for other targets? On riscv it would save one predicated instruction.
[Bug middle-end/120378] Support narrowing clip idiom
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378 --- Comment #2 from Richard Biener --- So what does vnclipu do? But yes, the way to fix is to add an optab for this, a vectorizer pattern and/or a match rule (in case that insn is a thing for non-vector as well).
[Bug middle-end/120378] Support narrowing clip idiom
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120378 --- Comment #1 from Hongtao Liu --- > The ifcvt'ed code before vect is: > > _4 = *_3; > x.0_12 = (unsigned int) _4; > _38 = -x.0_12; > _15 = (int) _38; > _16 = _15 >> 31; > _29 = x.0_12 > 255; > _17 = _29 ? _16 : _4; > _18 = (unsigned char) _17; > For the testcase in PR, I think x.0_12 > 255 must be false since it's zero_extend from unsigned char. So the comparison can be optimized off?