[llvm-bugs] [Bug 169722] manual implementation of `_mm_mask_cvtusepi16_epi8` does not optimize well

LLVM Bugs via llvm-bugs Fri, 28 Nov 2025 22:15:29 -0800

Issue	169722
Summary	manual implementation of `_mm_mask_cvtusepi16_epi8` does not optimize well
Labels	new issue
Assignees
Reporter	folkertdev

    In `rustc` it is convenient to define vectorized functions in terms of the cross-platform primitives that LLVM provides, rather than using platform-specific intrinsics. One reason is that then miri (the rust interpreter) and the non-llvm backends can reuse this code (though generally generating less efficient assembly). Often we can get that to work, but we're struggling with `_mm_mask_cvtusepi16_epi8`:


https://godbolt.org/z/4axG5bEf5

```asm
_mm256_mask_cvtusepi16_epi8_intrinsic: # @_mm256_mask_cvtusepi16_epi8_intrinsic
        kmovd   k1, edi
 vpmovuswb       xmm0 {k1}, ymm1
        vzeroupper
 ret
_mm256_mask_cvtusepi16_epi8_manual:     # @_mm256_mask_cvtusepi16_epi8_manual
        kmovd   k1, edi
 vpmovuswb       xmm0 {k1}, ymm1
        vzeroupper
 ret
_mm_mask_cvtusepi16_epi8_intrinsic:     # @_mm_mask_cvtusepi16_epi8_intrinsic
        kmovd   k1, edi
 vpmovuswb       xmm0 {k1}, xmm1
 ret
_mm_mask_cvtusepi16_epi8_manual:        # @_mm_mask_cvtusepi16_epi8_manual
        vpmovuswb       xmm1, xmm1
 or      edi, 65280
        kmovd   k1, edi
        vmovdqu8        xmm0 {k1}, xmm1
        ret
```

Clearly some thought has gone into this, and with LLVM 21 the 256-bit version optimizes nicely. But the 128-bit version does not, somehow LLVM gets confused. I believe that the culprit is a widening truncation:

```
Optimized legalized selection DAG: %bb.0 'cvtusepi16_epi8_128:bb4'
SelectionDAG has 17 nodes:
  t0: ch,glue = EntryToken
            t4: i32,ch = CopyFromReg t0, Register:i32 %1
 t6: i32 = AssertZext t4, ValueType:ch:i8
        t24: i16 = truncate t6
 t18: v16i1 = bitcast t24
        t9: v8i16,ch = CopyFromReg t0, Register:v8i16 %2
      t26: v16i8 = X86ISD::VTRUNCUS t9
      t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
    t19: v16i8 = vselect t18, t26, t2
 t22: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t19
  t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:v16i8 $xmm0, t22:1


===== Instruction selection begins: %bb.0 'bb4'
```

If you look carefully, the `X86ISD::VTRUNCUS` turns a `v8i16` into a `v16i8`, so it truncates the elements, but creates 8 extra elements out of thin air. This type change seems to confuse the patterns that would otherwise catch the `vselect` on a `VTRUNCUS` and convert it to the more efficient form.

So, is there a way to fix this, either by tweaking the input IR so that the relevant rules fire, or by adding or adjusting rewrites in LLVM itself?

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 169722] manual implementation of `_mm_mask_cvtusepi16_epi8` does not optimize well

Reply via email to