Issue 176879
Summary [X86] Match PCLMULQDQ codegen with llvm.clmul intrinsic implementation
Labels backend:X86, missed-optimization
Assignees
Reporter RKSimon
    https://rust.godbolt.org/z/Gznh7aMrY

Attempting to recreate the PCLMULQDQ instruction using generic llvm.clmul results in less than idea codegen:
```ll
define <2 x i64> @pclmul(<2 x i64> %v0, <2 x i64> %v1) {
  %i0 = zext i1 0 to i64 ; constant time lo/hi select
  %i1 = zext i1 1 to i64 ; constant time lo/hi select
  %a0 = extractelement <2 x i64> %v0, i64 %i0
  %a1 = extractelement <2 x i64> %v1, i64 %i1
  %x0 = zext i64 %a0 to i128
  %x1 = zext i64 %a1 to i128
  %cl = call i128 @llvm.clmul.i128(i128 %x0, i128 %x1)
  %r = bitcast i128 %cl to <2 x i64>
 ret <2 x i64> %r
}
```
```asm
pclmul:                                 # @pclmul
        vpshufd $238, %xmm1, %xmm1              # xmm1 = xmm1[2,3,2,3]
        xorl    %eax, %eax
        vmovq   %rax, %xmm2
 vpclmulqdq      $0, %xmm2, %xmm1, %xmm3
        vmovq   %xmm3, %rax
 vpclmulqdq      $0, %xmm2, %xmm0, %xmm2
        vmovq   %xmm2, %rcx
 xorq    %rax, %rcx
        vpclmulqdq      $0, %xmm1, %xmm0, %xmm0
 vpextrq $1, %xmm0, %rax
        xorq    %rcx, %rax
        vmovq   %rax, %xmm1
        vpunpcklqdq     %xmm1, %xmm0, %xmm0     # xmm0 = xmm0[0],xmm1[0]
        retq
```

- Failing to fold clmul(x,0) -> 0
- Failing to fold shuffles into pclmulqdq masks - pclmulqdq(shuffle(x),y,c0) -> pclmulqdq(x,y,c1)
- Avoiding fpu <-> gpu traffic

This ticket isn't about removing the the PCLMULQDQ intrinsics - just ensuring that llvm.clmul lowering is reasonably efficient so we can safely use it for other bit twiddling tricks.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to