[PATCH] D86855: Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.

James Y Knight via Phabricator via cfe-commits Wed, 06 Jan 2021 20:30:42 -0800

jyknight marked 7 inline comments as done.
jyknight added a comment.
Herald added a subscriber: pengfei.


I've finally got back to moving this patch forward -- PTAL, thanks!

To start with, I wrote a simple test-suite to verify the functionality of these 
changes. I've included the tests I wrote under mmx-tests/ in this version of 
the patch -- although it doesn't actually belong there. I'm not exactly sure 
where it _does_ belong, however.

The test-suite runs a number of combinations of inputs against two different 
compilers' implementations of these intrinsics, and makes sure they produce 
identical results. I used this to ensure that there are no changes in behavior 
between old clang and clang after this change, as well as compared clang to 
GCC. Using that, I've fixed and verified all the bugs you noticed in codereview 
already, as well as additional bugs the testsuite found (in _mm_maddubs_pi16 
and _mm_shuffle_pi8). I'm feeling reasonably confident, now, that this change 
will not change behavior of these functions. The tests also discovered two bugs 
in GCC, https://gcc.gnu.org/PR98495, https://gcc.gnu.org/PR98522.

Some other changes in this update:

I switched _mm_extract_pi16 and _mm_insert_pi16 back to using an clang 
intrinsic, for consistency with the other extract/insert macros, which are 
using an intrinsic function simply to force the element-number to be a 
compile-time constant, and produce an error when it's not. But, the intrinsic 
now lowers to generic IR like all the other `__builtin_ia32_vec_{ext,set}_*`, 
rather than an llvm intrinsic forcing MMX. I modified the "composite" functions 
in xmmintrin.h to directly use 128-bit operations, instead of composites of 
multiple 64bit operations, where possible.

Finally, the clang tests have been updated, so that all tests pass again.



================
Comment at: clang/lib/Headers/mmintrin.h:367
 {
-    return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2);
+    return (__m64)(((__v8qi)__m1) + ((__v8qi)__m2));
 }
----------------
craig.topper wrote:
> I think you we should use __v8qu to match what we do in emmintrin.h. We don't 
> currently set nsw on signed vector arithmetic, but we should be careful in 
> case that changes in the future.
Done, here and everywhere else I was using signed math (except the comparisons).


================
Comment at: clang/lib/Headers/mmintrin.h:1097
 {
-    return __builtin_ia32_pand((__v1di)__m1, (__v1di)__m2);
+    return (__m64)(((__v1di)__m1) & ((__v1di)__m2));
 }
----------------
craig.topper wrote:
> I think we probably want to use a v2su or v2si here. Using v1di scalarizes 
> and splits on 32-bit targets. On 64-bit targets it emits GPR code.
AFAICT, this doesn't matter? It seems to emit GPR or XMM code just depending on 
whether the result values are needed as XMM or not, independent of whether the 
type is specified as v2su or v1du.



================
Comment at: clang/lib/Headers/mmintrin.h:1242
 {
-    return (__m64)__builtin_ia32_pcmpgtb((__v8qi)__m1, (__v8qi)__m2);
+    return (__m64)((__v8qi)__m1 > (__v8qi)__m2);
 }
----------------
craig.topper wrote:
> Need to use __v8qs here to force "signed char" elements. __v8qi uses "char" 
> which has platform dependent signedness or can be changed with a command line.
Done.


================
Comment at: clang/lib/Headers/mmintrin.h:1264
 {
-    return (__m64)__builtin_ia32_pcmpgtw((__v4hi)__m1, (__v4hi)__m2);
+    return (__m64)((__v4hi)__m1 > (__v4hi)__m2);
 }
----------------
craig.topper wrote:
> Same here
This is a short, which is always signed, so it should be ok as written.


================
Comment at: clang/lib/Headers/mmintrin.h:1394
 {
-    return _mm_set_pi32(__i, __i);
+    return __extension__ (__m64)(__v2si){__i, __i};
 }
----------------
craig.topper wrote:
> Is this needed?
No, reverted this change and the others like it.


================
Comment at: clang/lib/Headers/mmintrin.h:1452
 {
-    return _mm_set_pi32(__i1, __i0);
+    return __extension__ (__m64)(__v2si){__i1, __i0};
 }
----------------
craig.topper wrote:
> I don't think this change is needed. And I think the operands are in the 
> wrong order.
Change was unnecessary, so reverted. (But operands are supposed to be backwards 
here.)


================
Comment at: clang/lib/Headers/tmmintrin.h:17
 #define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, 
__target__("ssse3"), __min_vector_width__(64)))
-#define __DEFAULT_FN_ATTRS_MMX __attribute__((__always_inline__, __nodebug__, 
__target__("mmx,ssse3"), __min_vector_width__(64)))
+#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ 
(__v2di){}, 0)
+#define __anyext128(x) (__m128i)__builtin_shufflevector((__v1di)(x), 
__extension__ (__v1di){}, 0, -1)
----------------
craig.topper wrote:
> I'm worried that using v1di with the shuffles will lead to scalarization in 
> the type legalizer. Should we use v2si instead?
Converting `__trunc64` to v4si (and thus v2si return value) seems to make 
codegen _worse_ in some cases, and I don't see any case where it gets better. 

For example,
```
#define __trunc64_1(x) (__m64)__builtin_shufflevector((__v2di)(x), 
__extension__ (__v2di){}, 0)
#define __trunc64_2(x) (__m64)__builtin_shufflevector((__v4si)(x), 
__extension__ (__v4si){}, 0, 1)
__m64 trunc1(__m128 a, int i) { return __trunc64_1(__builtin_ia32_psllqi128(a, 
i)); }
__m64 trunc2(__m128 a, int i) { return __trunc64_2(__builtin_ia32_psllqi128(a, 
i)); }
}
```

In trunc2, you get two extraneous moves at the end:
```
        movd    %edi, %xmm1
        psllq   %xmm1, %xmm0
        movq    %xmm0, %rax // extra
        movq    %rax, %xmm0 // extra
```

I guess that's related to calling-convention lowering which turns m64 into 
"double" confusing the various IR simplifications?

Similarly, there's also extraneous moves to/from a GPR for argument passing 
sometimes. But I don't see an easy way around that. Both variants do that here, 
instead of just `movq    %xmm0, %xmm0`:

```
#define __anyext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), 
__extension__ (__v1di){}, 0, -1)
#define __anyext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), 
__extension__ (__v2si){}, 0, 1, -1, -1)
#define __zext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), 
__extension__ (__v1di){}, 0, 1)
#define __zext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), 
__extension__ (__v2si){}, 0, 1, 2, 3)

__m128 ext1(__m64 a) { return __builtin_convertvector((__v4si)__zext128_1(a), 
__v4sf)); }
__m128 ext2(__m64 a) { return __builtin_convertvector((__v4si)__zext128_2(a), 
__v4sf)); }
```

Both produce:
```
        movq    %xmm0, %rax
        movq    %rax, %xmm0
        cvtdq2ps        %xmm0, %xmm0
        retq
```

However, switching to variant 2 of `anyext128` and `zext128` does seem to 
improve things in other cases, avoiding _some_ of those sorts of extraneous 
moves to a scalar register and back again. So I've made that change.



================
Comment at: clang/lib/Headers/xmmintrin.h:2316
 {
-  return __builtin_ia32_pmovmskb((__v8qi)__a);
+  return __builtin_ia32_pmovmskb128((__v16qi)__anyext128(__a));
 }
----------------
craig.topper wrote:
> This doesn't guarantee zeroes in bits 15:8 does it?
It does not. Switched to zext128.


================
Comment at: clang/lib/Headers/xmmintrin.h:2404
+  __m128i __n128  = __zext128(__n);
+  if (((__SIZE_TYPE__)__p & 0xfff) > 4096-15 && ((__SIZE_TYPE__)__p & 0xfff) < 
4096-8) {
+    __p -= 8;
----------------
craig.topper wrote:
> Does this work with large pages?
Yes -- this needs to be the boundary at which a trap _might_ occur if we 
crossed it. Whether it's in fact the end of of a page or not is irrelevant, 
only that it _could_ be.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86855/new/

https://reviews.llvm.org/D86855

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86855: Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.

Reply via email to