| Issue |
169722
|
| Summary |
manual implementation of `_mm_mask_cvtusepi16_epi8` does not optimize well
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
folkertdev
|
In `rustc` it is convenient to define vectorized functions in terms of the cross-platform primitives that LLVM provides, rather than using platform-specific intrinsics. One reason is that then miri (the rust interpreter) and the non-llvm backends can reuse this code (though generally generating less efficient assembly). Often we can get that to work, but we're struggling with `_mm_mask_cvtusepi16_epi8`:
https://godbolt.org/z/4axG5bEf5
```asm
_mm256_mask_cvtusepi16_epi8_intrinsic: # @_mm256_mask_cvtusepi16_epi8_intrinsic
kmovd k1, edi
vpmovuswb xmm0 {k1}, ymm1
vzeroupper
ret
_mm256_mask_cvtusepi16_epi8_manual: # @_mm256_mask_cvtusepi16_epi8_manual
kmovd k1, edi
vpmovuswb xmm0 {k1}, ymm1
vzeroupper
ret
_mm_mask_cvtusepi16_epi8_intrinsic: # @_mm_mask_cvtusepi16_epi8_intrinsic
kmovd k1, edi
vpmovuswb xmm0 {k1}, xmm1
ret
_mm_mask_cvtusepi16_epi8_manual: # @_mm_mask_cvtusepi16_epi8_manual
vpmovuswb xmm1, xmm1
or edi, 65280
kmovd k1, edi
vmovdqu8 xmm0 {k1}, xmm1
ret
```
Clearly some thought has gone into this, and with LLVM 21 the 256-bit version optimizes nicely. But the 128-bit version does not, somehow LLVM gets confused. I believe that the culprit is a widening truncation:
```
Optimized legalized selection DAG: %bb.0 'cvtusepi16_epi8_128:bb4'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t6: i32 = AssertZext t4, ValueType:ch:i8
t24: i16 = truncate t6
t18: v16i1 = bitcast t24
t9: v8i16,ch = CopyFromReg t0, Register:v8i16 %2
t26: v16i8 = X86ISD::VTRUNCUS t9
t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
t19: v16i8 = vselect t18, t26, t2
t22: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t19
t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:v16i8 $xmm0, t22:1
===== Instruction selection begins: %bb.0 'bb4'
```
If you look carefully, the `X86ISD::VTRUNCUS` turns a `v8i16` into a `v16i8`, so it truncates the elements, but creates 8 extra elements out of thin air. This type change seems to confuse the patterns that would otherwise catch the `vselect` on a `VTRUNCUS` and convert it to the more efficient form.
So, is there a way to fix this, either by tweaking the input IR so that the relevant rules fire, or by adding or adjusting rewrites in LLVM itself?
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs