[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-26 Thread James Y Knight via cfe-commits

jyknight wrote:

> Really, the question is whether we plan to completely drop support for the 
> x86_mmx type (including inline asm operands/results)

Yes, I do think it would be good to eliminate the type.

For inline-asm, we could switch to using a standard IR vector type for "y" 
constraint operands/results, and teach IR lowering to copy to/from MMX 
registers at the border. This is basically what Clang does already, at the IR 
level; we'd be pushing that down into IR lowering. It would have some minor 
performance impact on anything passing MMX values directly between two 
inline-asm statements (redundant movdq2q/movq2dq), but, so far as I can tell, 
almost nobody ever uses the "y" constraint anyways -- mmx registers are more 
often loaded/stored to memory inside of an inline-asm, instead. Also, clearly 
nothing still using MMX can be _that_ performance sensitive, or it would've 
been migrated to SSE/AVX sometime in the last 20 years.

One more option which is made trivial by eliminating the x86_mmx IR type would 
be to insert an "emms" after the return-value extraction for all inline-asm 
statements which are marked with either mmx clobbers or a "y" constraint. It 
would be trivial at that point -- there'd no longer be any need for any special 
logic to track where to insert it, since we can be sure there is not any live 
MMX state to worry about. That comes with more potential for performance 
impact, of course. (but, again, maybe that doesn't actually matter).

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -0,0 +1,29 @@
+USE_XMM=

phoebewang wrote:

How about we move these old implementations in to a seperate file (or leave 
them where they are if you like) and rename them to _dept, so that we don't 
rely on old compilers? We can then remove them as well as the tests in the next 
release.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
+/* Unsigned types */
+typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
+typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
+
+/* We need an explicitly signed variant for char. Note that this shouldn't
+ * appear in the interface though. */
+typedef signed char __v8qs __attribute__((__vector_size__(8)));
+
+/* SSE/SSE2 types */
+typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef char __v16qi __attribute__((__vector_size__(16)));
+
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS 
\
-  __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), 
\
- __min_vector_width__(64)))
+#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(64)))

phoebewang wrote:

They are used to identify the maximum registers a function will use. The 
backend doesn't distinguish the difference between scalar and 128-bit for now. 
But we should make it comply the rule here.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Eli Friedman via cfe-commits

efriedma-quic wrote:

> Or, if we do need to preserve bitcode compat, how to best achieve it? Perhaps 
> we convert them into inline-asm in the bitcode upgrader?

Really, the question is whether we plan to completely drop support for the 
x86_mmx type (including inline asm operands/results).  If we don't, then 
there's not much reason to touch the LLVM IR builtins; there isn't any actual 
"code" outside of TableGen, so there's basically zero maintenance work.  If we 
do drop it, then autoupgrade becomes very complicated and not really useful.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits

jyknight wrote:

> I guess the clang calling convention code never uses MMX types for 
> passing/returning values?

Correct, Clang never uses MMX types in its calling convention. This is actually 
_wrong_ for the 32-bit x86 psABI. You're supposed to pass the first 3 MMX args 
in mm0-2, and return the first MMX value in mm0. Yet...conflicting with those 
statements, it also states that all functions MUST be entered in x87 mode, and 
that you must call emms before returning. _shrug_.

We did attempt to implement the arg/return-passing rules for MMX in 
llvm/lib/Target/X86/X86CallingConv.td, but it doesn't actually apply to the IR 
Clang emits, since Clang never uses the `x87mmx` type, except as needed around 
the MMX LLVM-builtins, and inline-asm.

Anyhow, I propose that we _do not_ attempt to fix Clang's ABI to conform with 
the 32-bit psABI. We've gotten it wrong for a decade, and at this point, 
"fixing" it to use MMX registers it would be worse than not doing so.

> Have you looked at the code quality? #41665 mentions potential issues with 
> widening vectors.

I've glanced at it. In optimized code, the codegen looks pretty good. 
Unoptimized code looks pretty bad _before_ my changes, and looks about the same 
after. I have not attempted to measure performance of any MMX-intrinsics-using 
code.

> This doesn't touch inline asm or _mm_empty; I guess you're leaving that for a 
> followup?

Correct. That needs additional work, which I have not done.

I do plan to add to this PR another commit that deletes all the `__builtin_*` 
MMX functions, which are no longer used, after the header changes here.

However, that will leave all those MMX intrinsics existing still on the LLVM 
side, and I'm not sure how to go about removing those. Should we just do it, 
and break bitcode backwards-compatibility for those files? Or, if we do need to 
preserve bitcode compat, how to best achieve it? Perhaps we convert them into 
inline-asm in the bitcode upgrader?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits

https://github.com/jyknight edited 
https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 

jyknight wrote:

IIUC, these files intentionally don't have any dependency on subtarget 
preprocessor defines, in order that they can be used from functions with target 
attributes. So I'm not sure if adding an `#ifndef __SSE2__` would be acceptable 
here?

The current error is:
```
echo $'#include \n__m64 f() { return _mm_cvtsi32_si64(5); }' | 
build/bin/clang -march=pentium3 -m32 -S -o - -xc -
:2:20: error: always_inline function '_mm_cvtsi32_si64' requires target 
feature 'sse2', but would be inlined into function 'f' that is compiled without 
support for 'sse2'
2 | __m64 f() { return _mm_cvtsi32_si64(5); }
  |^
1 error generated.
=> exit status: 1
```

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -177,7 +175,10 @@ _mm_abs_epi32(__m128i __a)
 /// \returns A 64-bit integer vector containing the concatenated right-shifted
 ///value.
 #define _mm_alignr_pi8(a, b, n) \
-  ((__m64)__builtin_ia32_palignr((__v8qi)(__m64)(a), (__v8qi)(__m64)(b), (n)))
+  ((__m64)__builtin_shufflevector(   \

jyknight wrote:

Cannot use it in a macro for external use, because __trunc64 is undefed at the 
end of the header.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -614,12 +623,15 @@ _mm_shuffle_epi8(__m128i __a, __m128i __b)
 ///1: Clear the corresponding byte in the destination. \n
 ///0: Copy the selected source byte to the corresponding byte in the
 ///destination. \n
-///Bits [3:0] select the source byte to be copied.
+///Bits [2:0] select the source byte to be copied.
 /// \returns A 64-bit integer vector containing the copied or cleared values.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX
+static __inline__ __m64 __DEFAULT_FN_ATTRS
 _mm_shuffle_pi8(__m64 __a, __m64 __b)
 {
-return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b);
+return __trunc64(__builtin_ia32_pshufb128(
+(__v16qi)__builtin_shufflevector(
+(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1),

jyknight wrote:

The behavior is supposed to be that only the bottom 3 bits of `__b` affect the 
result for the 64-bit operation. For the 128-bit operation, however, the bottom 
4 bits are used.

By duplicating the input vector, we automatically get the proper behavior 
without having to mask out bit 3 of `__b`.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -242,10 +243,11 @@ _mm_hadd_epi32(__m128i __a, __m128i __b)
 ///destination.
 /// \returns A 64-bit vector of [4 x i16] containing the horizontal sums of 
both
 ///operands.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX
+static __inline__ __m64 __DEFAULT_FN_ATTRS
 _mm_hadd_pi16(__m64 __a, __m64 __b)
 {
-return (__m64)__builtin_ia32_phaddw((__v4hi)__a, (__v4hi)__b);
+return __extract2_32(__builtin_ia32_phaddw128((__v8hi)__anyext128(__a),

jyknight wrote:

Done.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -32,12 +32,13 @@ typedef unsigned int __v4su 
__attribute__((__vector_size__(16)));
 #endif
 
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS 
\
-  __attribute__((__always_inline__, __nodebug__, __target__("sse,no-evex512"), 
\
- __min_vector_width__(128)))
-#define __DEFAULT_FN_ATTRS_MMX 
\
-  __attribute__((__always_inline__, __nodebug__,   
\
- __target__("mmx,sse,no-evex512"), __min_vector_width__(64)))
+#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, 
__target__("sse,no-evex512"), __min_vector_width__(128)))
+#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(64)))
+
+#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ 
(__v2di){}, 0)
+#define __zext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), 
__extension__ (__v2si){}, 0, 1, 2, 3)

jyknight wrote:

Vectors follow the same init-list rules as array/aggregate, where unmentioned 
elements are zeroed.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -0,0 +1,29 @@
+USE_XMM=
+#USE_XMM=--use-xmm

jyknight wrote:

It works if you first delete/ifdef-out test_stores and test_maskmove. I didn't 
make those generic.

(The "USE_XMM" version is to validate that the unused bits of the input xmm 
registers being set to arbitrary values won't cause trouble, since some of the 
instructions trigger side-effects, like floating-point exception flags being 
set.)

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -17,13 +17,11 @@
 #include 
 
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS 
\
-  __attribute__((__always_inline__, __nodebug__,   
\
- __target__("ssse3,no-evex512"), __min_vector_width__(64)))
-#define __DEFAULT_FN_ATTRS_MMX 
\
-  __attribute__((__always_inline__, __nodebug__,   
\
- __target__("mmx,ssse3,no-evex512"),   
\
- __min_vector_width__(64)))
+#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, 
__target__("ssse3,no-evex512"), __min_vector_width__(64)))

jyknight wrote:

Done.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -1558,10 +1559,10 @@ _mm_cvttss_si64(__m128 __a)
 /// \param __a
 ///A 128-bit vector of [4 x float].
 /// \returns A 64-bit integer vector containing the converted values.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_cvttps_pi32(__m128 __a)
 {
-  return (__m64)__builtin_ia32_cvttps2pi((__v4sf)__a);
+  return __trunc64(__builtin_ia32_cvttps2dq((__v4sf)__zeroupper64(__a)));

jyknight wrote:

I'm not sure: is `__builtin_convertvector` from float->int guaranteed to have 
the same semantics as this requires?

Even if feasible, I'd prefer to leave that change to some future work that 
eliminates the `__builtin_ia32_cvttps2dq` (and similar functions), since the 
same should be done to `_mm_cvttps_epi32`, `_mm256_cvttps_epi32`, 
`_mm_cvtpd_epi32`, `_mm_cvtpd_pi32`, and `_mm256_cvtpd_epi32`, at least.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
+/* Unsigned types */
+typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
+typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
+
+/* We need an explicitly signed variant for char. Note that this shouldn't
+ * appear in the interface though. */
+typedef signed char __v8qs __attribute__((__vector_size__(8)));
+
+/* SSE/SSE2 types */
+typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef char __v16qi __attribute__((__vector_size__(16)));
+
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS 
\
-  __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), 
\
- __min_vector_width__(64)))
+#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(64)))

jyknight wrote:

Done; both ehre and xmmintrin.h/tmmintrin.h.

I don't actually know what effect setting min_vector_width has, vs only setting 
the target to sse2, but I also note that tmmintrin.h appears to have been 
incorrect before my changes.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
+/* Unsigned types */
+typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
+typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
+
+/* We need an explicitly signed variant for char. Note that this shouldn't
+ * appear in the interface though. */
+typedef signed char __v8qs __attribute__((__vector_size__(8)));
+
+/* SSE/SSE2 types */
+typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef char __v16qi __attribute__((__vector_size__(16)));

jyknight wrote:

Users can include mmintrin.h by itself, though, so I had to copy them here.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -0,0 +1,29 @@
+USE_XMM=

jyknight wrote:

These are the tests I used to validate that the re-implementation works 
properly: I found it extremely important to have validation that my 
implementation works properly, because i certainly messed up a few times.

I did upload it intentionally, but I don't propose to commit as-is -- 
especially not in this location.

I definitely would like to commit something along these lines _somewhere_, 
however, so that future maintainers have the benefit of being able to test this 
functionality. Probably it belongs in the llvm-test-suite repository?

I'd love help on getting this into a submittable form.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -124,10 +143,11 @@ _mm_cvtm64_si64(__m64 __m)
 ///written to the upper 32 bits of the result.
 /// \returns A 64-bit integer vector of [8 x i8] containing the converted
 ///values.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_packs_pi16(__m64 __m1, __m64 __m2)
 {
-return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2);
+return __extract2_32(__builtin_ia32_packsswb128((__v8hi)__anyext128(__m1),

jyknight wrote:

So, the current version assembles to:
```
   0: 66 0f 63 c1   packsswb%xmm1, %xmm0
   4: 66 0f 70 c0 e8pshufd  $0xe8, %xmm0, %xmm0 # xmm0 
= xmm0[0,2,2,3]
   9: c3retq
```

You're suggesting to instead shuffle the inputs, like:
```
return __trunc64(__builtin_ia32_packsswb128(
__builtin_shufflevector(__m1, __m2, 0, 1), (__v8hi){}));
```

I agree, that's better. Saves 1 byte of code, and also less register pressure.
```
   0: 66 0f 6c c1   punpcklqdq  %xmm1, %xmm0# xmm0 
= xmm0[0],xmm1[0]
   4: 66 0f 63 c0   packsswb%xmm0, %xmm0
   8: c3retq
```

Done -- eliminated all uses of `__extract2_32`.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -2502,10 +2509,25 @@ _mm_mulhi_pu16(__m64 __a, __m64 __b)
 ///A pointer to a 64-bit memory location that will receive the 
conditionally
 ///copied integer values. The address of the memory location does not have
 ///to be aligned.
-static __inline__ void __DEFAULT_FN_ATTRS_MMX
+static __inline__ void __DEFAULT_FN_ATTRS_SSE2
 _mm_maskmove_si64(__m64 __d, __m64 __n, char *__p)
 {
-  __builtin_ia32_maskmovq((__v8qi)__d, (__v8qi)__n, __p);
+  // This is complex, because we need to support the case where __p is pointing
+  // within the last 15 to 8 bytes of a page. In that case, using a 128-bit
+  // write might cause a trap where a 64-bit maskmovq would not. (Memory
+  // locations not selected by the mask bits might still cause traps.)
+  __m128i __d128  = __anyext128(__d);
+  __m128i __n128  = __zext128(__n);
+  if (((__SIZE_TYPE__)__p & 0xfff) >= 4096-15 &&
+  ((__SIZE_TYPE__)__p & 0xfff) <= 4096-8) {

jyknight wrote:

I believe it's correct as written: we need to ensure that we cross a potential 
page-protection boundary in exactly the same situations as we would've 
originally.

Since we're now executing a 16-byte write instead of the specified 8-byte 
write, that means we need to back up by 8 bytes when we're at offsets 15, 14, 
13, 12, 11, 10, 9, and 8 before the end of the page.

At 16 bytes before, we're guaranteed to be within the page for both writes, and 
at 7 bytes before, we're guaranteed to cross potential-boundary for both.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -2108,9 +2106,8 @@ static __inline__ __m128i __DEFAULT_FN_ATTRS 
_mm_add_epi32(__m128i __a,
 /// \param __b
 ///A 64-bit integer.
 /// \returns A 64-bit integer containing the sum of both parameters.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX _mm_add_si64(__m64 __a,
-__m64 __b) {
-  return (__m64)__builtin_ia32_paddq((__v1di)__a, (__v1di)__b);
+static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_add_si64(__m64 __a, __m64 __b) {
+  return (__m64)(((unsigned long long)__a) + ((unsigned long long)__b));

jyknight wrote:

This must use an unsigned add, not a signed add, because wraparound must be 
defined-behavior.

The same rationale applies to the other addition/subtraction cases you've noted 
in various places below, so I'm going to just mark those conversations as 
resolved instead of responding to each separately.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -1035,10 +1077,11 @@ _mm_srli_pi32(__m64 __m, int __count)
 /// \param __count
 ///A 64-bit integer vector interpreted as a single 64-bit integer.
 /// \returns A 64-bit integer vector containing the right-shifted value.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_srl_si64(__m64 __m, __m64 __count)
 {
-return (__m64)__builtin_ia32_psrlq((__v1di)__m, __count);
+return __trunc64(__builtin_ia32_psrlq128((__v2di)__anyext128(__m),
+ __anyext128(__count)));

jyknight wrote:

Done.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits

https://github.com/jyknight commented:

Thanks for the detailed review comments!

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits


@@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", 
"mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse")
+TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse")
+TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse")

jyknight wrote:

The only purpose served by the `__builtin_ia32_vec_ext/set` is to emit a 
diagnostic if the immediate value is out of range. They are, otherwise, generic 
extract/insert operations, which could be just as well spelled `((__v4hi)a)[n]` 
and `((__v4hi)a)[n] = d;` -- they do not lower to a target-specific LLVM 
intrinsic.

This change makes the MMX ones consistent with the rest -- previously the MMX 
ones _did_ lower to an LLVM intrinsic.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread James Y Knight via cfe-commits

https://github.com/jyknight edited 
https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Simon Pilgrim via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 

RKSimon wrote:

Add a deprecation message if SSE2 is not defined or just rely on the 
__DEFAULT_FN_ATTRS_SSE2 function attribute error message?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Simon Pilgrim via cfe-commits


@@ -614,12 +623,15 @@ _mm_shuffle_epi8(__m128i __a, __m128i __b)
 ///1: Clear the corresponding byte in the destination. \n
 ///0: Copy the selected source byte to the corresponding byte in the
 ///destination. \n
-///Bits [3:0] select the source byte to be copied.
+///Bits [2:0] select the source byte to be copied.
 /// \returns A 64-bit integer vector containing the copied or cleared values.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX
+static __inline__ __m64 __DEFAULT_FN_ATTRS
 _mm_shuffle_pi8(__m64 __a, __m64 __b)
 {
-return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b);
+return __trunc64(__builtin_ia32_pshufb128(
+(__v16qi)__builtin_shufflevector(
+(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1),

RKSimon wrote:

MMX pshufb only uses the first 3-bits for index while SSE uses 4-bits - by 
splatting the subvector we avoid having to AND the mask input as it can 
reference either lo/hi subvector.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Eli Friedman via cfe-commits

efriedma-quic wrote:

I guess the clang calling convention code never uses MMX types for 
passing/returning values?

Have you looked at the code quality?  #41665 mentions potential issues with 
widening vectors.

This doesn't touch inline asm or _mm_empty; I guess you're leaving that for a 
followup?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -2502,10 +2509,25 @@ _mm_mulhi_pu16(__m64 __a, __m64 __b)
 ///A pointer to a 64-bit memory location that will receive the 
conditionally
 ///copied integer values. The address of the memory location does not have
 ///to be aligned.
-static __inline__ void __DEFAULT_FN_ATTRS_MMX
+static __inline__ void __DEFAULT_FN_ATTRS_SSE2
 _mm_maskmove_si64(__m64 __d, __m64 __n, char *__p)
 {
-  __builtin_ia32_maskmovq((__v8qi)__d, (__v8qi)__n, __p);
+  // This is complex, because we need to support the case where __p is pointing
+  // within the last 15 to 8 bytes of a page. In that case, using a 128-bit
+  // write might cause a trap where a 64-bit maskmovq would not. (Memory
+  // locations not selected by the mask bits might still cause traps.)
+  __m128i __d128  = __anyext128(__d);
+  __m128i __n128  = __zext128(__n);
+  if (((__SIZE_TYPE__)__p & 0xfff) >= 4096-15 &&
+  ((__SIZE_TYPE__)__p & 0xfff) <= 4096-8) {

phoebewang wrote:

`<` ?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -494,10 +520,10 @@ _mm_adds_pu16(__m64 __m1, __m64 __m2)
 ///A 64-bit integer vector of [8 x i8] containing the subtrahends.
 /// \returns A 64-bit integer vector of [8 x i8] containing the differences of
 ///both parameters.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_sub_pi8(__m64 __m1, __m64 __m2)
 {
-return (__m64)__builtin_ia32_psubb((__v8qi)__m1, (__v8qi)__m2);
+return (__m64)(((__v8qu)__m1) - ((__v8qu)__m2));

phoebewang wrote:

Seems you intended to use unsigned version. What's the reason to do so?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -0,0 +1,29 @@
+USE_XMM=

phoebewang wrote:

What these tests used for? Is your local tool uploaded unintentionally or you 
want them to be reviewed as well?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -124,10 +143,11 @@ _mm_cvtm64_si64(__m64 __m)
 ///written to the upper 32 bits of the result.
 /// \returns A 64-bit integer vector of [8 x i8] containing the converted
 ///values.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_packs_pi16(__m64 __m1, __m64 __m2)
 {
-return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2);
+return __extract2_32(__builtin_ia32_packsswb128((__v8hi)__anyext128(__m1),

phoebewang wrote:

Should it be better to shuffle `__m1` and `__m2` first and then `__trunc64`? 
The same for below.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -177,7 +175,10 @@ _mm_abs_epi32(__m128i __a)
 /// \returns A 64-bit integer vector containing the concatenated right-shifted
 ///value.
 #define _mm_alignr_pi8(a, b, n) \
-  ((__m64)__builtin_ia32_palignr((__v8qi)(__m64)(a), (__v8qi)(__m64)(b), (n)))
+  ((__m64)__builtin_shufflevector(   \

phoebewang wrote:

__trunc64?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
+/* Unsigned types */
+typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
+typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
+
+/* We need an explicitly signed variant for char. Note that this shouldn't
+ * appear in the interface though. */
+typedef signed char __v8qs __attribute__((__vector_size__(8)));
+
+/* SSE/SSE2 types */
+typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef char __v16qi __attribute__((__vector_size__(16)));
+
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS 
\
-  __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), 
\
- __min_vector_width__(64)))
+#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(64)))

phoebewang wrote:

Since we use SSE vector instructions, we need to see `__min_vector_width__(128)`

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -811,10 +843,11 @@ _mm_slli_pi32(__m64 __m, int __count)
 ///A 64-bit integer vector interpreted as a single 64-bit integer.
 /// \returns A 64-bit integer vector containing the left-shifted value. If
 /// \a __count is greater or equal to 64, the result is set to 0.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_sll_si64(__m64 __m, __m64 __count)
 {
-return (__m64)__builtin_ia32_psllq((__v1di)__m, __count);
+return __trunc64(__builtin_ia32_psllq128((__v2di)__anyext128(__m),
+ __anyext128(__count)));

phoebewang wrote:

Missing `(__v2di)`?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -614,12 +623,15 @@ _mm_shuffle_epi8(__m128i __a, __m128i __b)
 ///1: Clear the corresponding byte in the destination. \n
 ///0: Copy the selected source byte to the corresponding byte in the
 ///destination. \n
-///Bits [3:0] select the source byte to be copied.
+///Bits [2:0] select the source byte to be copied.
 /// \returns A 64-bit integer vector containing the copied or cleared values.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX
+static __inline__ __m64 __DEFAULT_FN_ATTRS
 _mm_shuffle_pi8(__m64 __a, __m64 __b)
 {
-return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b);
+return __trunc64(__builtin_ia32_pshufb128(
+(__v16qi)__builtin_shufflevector(
+(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1),

phoebewang wrote:

Why don't `__anyext128`?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", 
"mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse")
+TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse")
+TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse")

phoebewang wrote:

I see there's only one use of them, why don't use __trunc64/__anyext128 for 
them too?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -1035,10 +1077,11 @@ _mm_srli_pi32(__m64 __m, int __count)
 /// \param __count
 ///A 64-bit integer vector interpreted as a single 64-bit integer.
 /// \returns A 64-bit integer vector containing the right-shifted value.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_srl_si64(__m64 __m, __m64 __count)
 {
-return (__m64)__builtin_ia32_psrlq((__v1di)__m, __count);
+return __trunc64(__builtin_ia32_psrlq128((__v2di)__anyext128(__m),
+ __anyext128(__count)));

phoebewang wrote:

ditto.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -242,10 +243,11 @@ _mm_hadd_epi32(__m128i __a, __m128i __b)
 ///destination.
 /// \returns A 64-bit vector of [4 x i16] containing the horizontal sums of 
both
 ///operands.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX
+static __inline__ __m64 __DEFAULT_FN_ATTRS
 _mm_hadd_pi16(__m64 __a, __m64 __b)
 {
-return (__m64)__builtin_ia32_phaddw((__v4hi)__a, (__v4hi)__b);
+return __extract2_32(__builtin_ia32_phaddw128((__v8hi)__anyext128(__a),

phoebewang wrote:

Shuffle first and then __trunc64? The same below.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -337,10 +363,10 @@ _mm_unpacklo_pi32(__m64 __m1, __m64 __m2)
 ///A 64-bit integer vector of [8 x i8].
 /// \returns A 64-bit integer vector of [8 x i8] containing the sums of both
 ///parameters.
-static __inline__ __m64 __DEFAULT_FN_ATTRS
+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2
 _mm_add_pi8(__m64 __m1, __m64 __m2)
 {
-return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2);
+return (__m64)(((__v8qu)__m1) + ((__v8qu)__m2));

phoebewang wrote:

`__v8qi`? Same for below.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
+/* Unsigned types */
+typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
+typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
+
+/* We need an explicitly signed variant for char. Note that this shouldn't
+ * appear in the interface though. */
+typedef signed char __v8qs __attribute__((__vector_size__(8)));
+
+/* SSE/SSE2 types */
+typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef char __v16qi __attribute__((__vector_size__(16)));

phoebewang wrote:

These already defined in emmintrin.h

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -2539,9 +2536,8 @@ static __inline__ __m128i __DEFAULT_FN_ATTRS 
_mm_sub_epi32(__m128i __a,
 ///A 64-bit integer vector containing the subtrahend.
 /// \returns A 64-bit integer vector containing the difference of the values in
 ///the operands.
-static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX _mm_sub_si64(__m64 __a,
-__m64 __b) {
-  return (__m64)__builtin_ia32_psubq((__v1di)__a, (__v1di)__b);
+static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_sub_si64(__m64 __a, __m64 __b) {
+  return (__m64)((unsigned long long)__a - (unsigned long long)__b);

phoebewang wrote:

ditto.

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-25 Thread Phoebe Wang via cfe-commits


@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8)));
 typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
+/* Unsigned types */
+typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
+typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
+
+/* We need an explicitly signed variant for char. Note that this shouldn't
+ * appear in the interface though. */
+typedef signed char __v8qs __attribute__((__vector_size__(8)));
+
+/* SSE/SSE2 types */
+typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef char __v16qi __attribute__((__vector_size__(16)));
+
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS 
\
-  __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), 
\
- __min_vector_width__(64)))
+#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(64)))

phoebewang wrote:

Don't need explicit `_SSE2`?

https://github.com/llvm/llvm-project/pull/96540
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-24 Thread via cfe-commits

github-actions[bot] wrote:




:warning: C/C++ code formatter, clang-format found issues in your code. 
:warning:



You can test this locally with the following command:


``bash
git-clang-format --diff af6acd7442646fde56de919964bd52d7bb7922b2 
a17a0df1c3551693283dd806b901d3020f33e67f --extensions 'c,h,cpp' -- 
mmx-tests/test.c clang/lib/CodeGen/CGBuiltin.cpp clang/lib/Headers/emmintrin.h 
clang/lib/Headers/mmintrin.h clang/lib/Headers/tmmintrin.h 
clang/lib/Headers/xmmintrin.h clang/test/CodeGen/X86/mmx-builtins.c 
clang/test/CodeGen/X86/mmx-inline-asm.c 
clang/test/CodeGen/X86/mmx-shift-with-immediate.c 
clang/test/CodeGen/attr-target-x86-mmx.c clang/test/Headers/xmmintrin.c 
clang/test/Sema/x86-builtin-palignr.c
``





View the diff from clang-format here.


``diff
diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h
index 02160285d5..a3176570a4 100644
--- a/clang/lib/Headers/emmintrin.h
+++ b/clang/lib/Headers/emmintrin.h
@@ -49,10 +49,15 @@ typedef __bf16 __m128bh __attribute__((__vector_size__(16), 
__aligned__(16)));
 #endif
 
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(128)))
-
-#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ 
(__v2di){}, 0)
-#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), 
__extension__ (__v2si){}, 0, 1, -1, -1)
+#define __DEFAULT_FN_ATTRS 
\
+  __attribute__((__always_inline__, __nodebug__,   
\
+ __target__("sse2,no-evex512"), __min_vector_width__(128)))
+
+#define __trunc64(x)   
\
+  (__m64) __builtin_shufflevector((__v2di)(x), __extension__(__v2di){}, 0)
+#define __anyext128(x) 
\
+  (__m128i) __builtin_shufflevector((__v2si)(x), __extension__(__v2si){}, 0,   
\
+1, -1, -1)
 
 /// Adds lower double-precision values in both operands and returns the
 ///sum in the lower 64 bits of the result. The upper 64 bits of the result
diff --git a/clang/lib/Headers/mmintrin.h b/clang/lib/Headers/mmintrin.h
index 71d7487673..d4ccb3c92f 100644
--- a/clang/lib/Headers/mmintrin.h
+++ b/clang/lib/Headers/mmintrin.h
@@ -22,8 +22,8 @@ typedef short __v4hi __attribute__((__vector_size__(8)));
 typedef char __v8qi __attribute__((__vector_size__(8)));
 
 /* Unsigned types */
-typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8)));
-typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+typedef unsigned long long __v1du __attribute__((__vector_size__(8)));
+typedef unsigned int __v2su __attribute__((__vector_size__(8)));
 typedef unsigned short __v4hu __attribute__((__vector_size__(8)));
 typedef unsigned char __v8qu __attribute__((__vector_size__(8)));
 
@@ -33,17 +33,23 @@ typedef signed char __v8qs 
__attribute__((__vector_size__(8)));
 
 /* SSE/SSE2 types */
 typedef long long __m128i __attribute__((__vector_size__(16), 
__aligned__(16)));
-typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+typedef long long __v2di __attribute__((__vector_size__(16)));
 typedef int __v4si __attribute__((__vector_size__(16)));
 typedef short __v8hi __attribute__((__vector_size__(16)));
 typedef char __v16qi __attribute__((__vector_size__(16)));
 
 /* Define the default attributes for the functions in this file. */
-#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, 
__target__("sse2,no-evex512"), __min_vector_width__(64)))
-
-#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ 
(__v2di){}, 0)
-#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), 
__extension__ (__v2si){}, 0, 1, -1, -1)
-#define __extract2_32(a) (__m64)__builtin_shufflevector((__v4si)(a), 
__extension__ (__v4si){}, 0, 2);
+#define __DEFAULT_FN_ATTRS_SSE2
\
+  __attribute__((__always_inline__, __nodebug__,   
\
+ __target__("sse2,no-evex512"), __min_vector_width__(64)))
+
+#define __trunc64(x)   
\
+  (__m64) __builtin_shufflevector((__v2di)(x), __extension__(__v2di){}, 0)
+#define __anyext128(x) 
\
+  (__m128i) __builtin_shufflevector((__v2si)(x), __extension__(__v2si){}, 0,   
\
+1, -1, -1)
+#define __extract2_32(a)   
\
+  (__m64) __builtin_shufflevector((__v4si)(a), __extension__(__v4si){}, 0, 2);
 
 /// Clears the MMX state by setting the state of the x87 stack registers
 ///to empty.
@@ -69,10 +75,8 @@ _mm_empty(void) {
 ///A 32-bit integer 

[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-24 Thread via cfe-commits

github-actions[bot] wrote:




:warning: Python code formatter, darker found issues in your code. :warning:



You can test this locally with the following command:


``bash
darker --check --diff -r 
af6acd7442646fde56de919964bd52d7bb7922b2...a17a0df1c3551693283dd806b901d3020f33e67f
 mmx-tests/mmx-tests.py
``





View the diff from darker here.


``diff
--- mmx-tests.py2024-06-21 21:25:54.00 +
+++ mmx-tests.py2024-06-24 19:29:45.332608 +
@@ -1,301 +1,1048 @@
 #!/usr/bin/python3
 
 import argparse
 import sys
+
 
 # This is a list of all intel functions and macros which take or
 # return an __m64.
 def do_mmx(fn):
-  # mmintrin.h
-  fn("_mm_cvtsi32_si64", "__m64", ("int", ))
-  fn("_mm_cvtsi64_si32", "int", ("__m64", ))
-  fn("_mm_cvtsi64_m64", "__m64", ("long long", ), 
condition='defined(__X86_64__) || defined(__clang__)')
-  fn("_mm_cvtm64_si64", "long long", ("__m64", ), 
condition='defined(__X86_64__) || defined(__clang__)')
-  fn("_mm_packs_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_packs_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_packs_pu16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_unpackhi_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_unpackhi_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_unpackhi_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_unpacklo_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_unpacklo_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_unpacklo_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_add_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_add_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_add_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_adds_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_adds_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_adds_pu8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_adds_pu16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_sub_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_sub_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_sub_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_subs_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_subs_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_subs_pu8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_subs_pu16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_madd_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_mulhi_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_mullo_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_sll_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_slli_pi16", "__m64", ("__m64", "int", ))
-  fn("_mm_sll_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_slli_pi32", "__m64", ("__m64", "int", ))
-  fn("_mm_sll_si64", "__m64", ("__m64", "__m64", ))
-  fn("_mm_slli_si64", "__m64", ("__m64", "int", ))
-  fn("_mm_sra_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_srai_pi16", "__m64", ("__m64", "int", ))
-  fn("_mm_sra_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_srai_pi32", "__m64", ("__m64", "int", ))
-  fn("_mm_srl_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_srli_pi16", "__m64", ("__m64", "int", ))
-  fn("_mm_srl_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_srli_pi32", "__m64", ("__m64", "int", ))
-  fn("_mm_srl_si64", "__m64", ("__m64", "__m64", ))
-  fn("_mm_srli_si64", "__m64", ("__m64", "int", ))
-  fn("_mm_and_si64", "__m64", ("__m64", "__m64", ))
-  fn("_mm_andnot_si64", "__m64", ("__m64", "__m64", ))
-  fn("_mm_or_si64", "__m64", ("__m64", "__m64", ))
-  fn("_mm_xor_si64", "__m64", ("__m64", "__m64", ))
-  fn("_mm_cmpeq_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_cmpeq_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_cmpeq_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_cmpgt_pi8", "__m64", ("__m64", "__m64", ))
-  fn("_mm_cmpgt_pi16", "__m64", ("__m64", "__m64", ))
-  fn("_mm_cmpgt_pi32", "__m64", ("__m64", "__m64", ))
-  fn("_mm_setzero_si64", "__m64", ())
-  fn("_mm_set_pi32", "__m64", ("int", "int", ))
-  fn("_mm_set_pi16", "__m64", ("short", "short", "short", "short", ))
-  fn("_mm_set_pi8", "__m64", ("char", "char", "char", "char", "char", "char", 
"char", "char", ))
-  fn("_mm_set1_pi32", "__m64", ("int", ))
-  fn("_mm_set1_pi16", "__m64", ("short", ))
-  fn("_mm_set1_pi8", "__m64", ("char", ))
-  fn("_mm_setr_pi32", "__m64", ("int", "int", ))
-  fn("_mm_setr_pi16", "__m64", ("short", "short", "short", "short", ))
-  fn("_mm_setr_pi8", "__m64", ("char", "char", "char", "char", "char", "char", 
"char", "char", ))
-
-  # xmmintrin.h
-  fn("_mm_cvtps_pi32", "__m64", ("__m128", ))
-  fn("_mm_cvt_ps2pi", "__m64", ("__m128", ))
-  fn("_mm_cvttps_pi32", "__m64", ("__m128", ))
-  fn("_mm_cvtt_ps2pi", "__m64", ("__m128", ))
-  fn("_mm_cvtpi32_ps", "__m128", ("__m128", "__m64", ))
-  fn("_mm_cvt_pi2ps", "__m128", ("__m128", "__m64", ))
-  fn("_mm_loadh_pi", "__m128", ("__m128", "const __m64 *", ))
-  fn("_mm_loadl_pi", "__m128", ("__m128", "const __m64 *", ))
-  fn("_mm_storeh_pi", "void", ("__m64 *", "__m128", ))
-  fn("_mm_storel_pi", "void", ("__m64 *", "__m128", ))
-  fn("_mm_stream_pi", "void", 

[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-24 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-llvm-ir

Author: James Y Knight (jyknight)


Changes

The MMX instruction set is legacy, and the SSE2 variants are in every way 
superior, when they are available -- and they have been available since the 
Pentium 4 was released, 20 years ago.

Therefore, we are switching the "MMX" intrinsics to depend on SSE2, 
unconditionally. This change entirely drops the ability to generate vectorized 
code using compiler intrinsics for chips with MMX but without SSE2: the Intel 
Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as AMD 
K6 and K7 series chips of around the same timeframe. (Note that targeting these 
older CPUs remains supported, simply without the ability to use MMX compiler 
intrinsics.)

Migrating away from the use of MMX also fixes a rather non-obvious requirement 
for users of the intrinsics API. The long-standing programming model for MMX 
requires that the programmer be aware of the x87/MMX mode-switching semantics, 
and manually call _mm_empty() between using any MMX instruction and any x87 FPU 
instruction. If you neglect to, then every future x87 operation will return a 
NaN result. This requirement is not at all obvious to users of these these 
intrinsics, and causes very difficult to detect bugs.

Additionally, in some circumstances, LLVM may reorder x87 and mmx operations 
around each-other, unaware of this mode switching issue. So, even inserting 
_mm_empty() calls appropriately will not always guarantee correct operation.

Eliminating the use of MMX fixes both these latter issues.

(Originally uploaded at https://reviews.llvm.org/D86855)

Works towards issue #41665.

---

Patch is 125.33 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/96540.diff


16 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsX86.def (+2-2) 
- (modified) clang/lib/CodeGen/CGBuiltin.cpp (+2) 
- (modified) clang/lib/Headers/emmintrin.h (+20-22) 
- (modified) clang/lib/Headers/mmintrin.h (+177-128) 
- (modified) clang/lib/Headers/tmmintrin.h (+57-40) 
- (modified) clang/lib/Headers/xmmintrin.h (+89-99) 
- (modified) clang/test/CodeGen/X86/mmx-builtins.c (+109-100) 
- (modified) clang/test/CodeGen/X86/mmx-inline-asm.c (+1-1) 
- (modified) clang/test/CodeGen/X86/mmx-shift-with-immediate.c (+9-9) 
- (modified) clang/test/CodeGen/attr-target-x86-mmx.c (+3-4) 
- (modified) clang/test/Headers/xmmintrin.c (+1-1) 
- (modified) clang/test/Sema/x86-builtin-palignr.c (+1-1) 
- (modified) llvm/include/llvm/IR/IntrinsicsX86.td (+2-2) 
- (added) mmx-tests/Makefile (+29) 
- (added) mmx-tests/mmx-tests.py (+301) 
- (added) mmx-tests/test.c (+237) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsX86.def 
b/clang/include/clang/Basic/BuiltinsX86.def
index 7074479786b97..612673127a376 100644
--- a/clang/include/clang/Basic/BuiltinsX86.def
+++ b/clang/include/clang/Basic/BuiltinsX86.def
@@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", 
"mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse")
+TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse")
+TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse")
 
 // MMX+SSE2
 TARGET_BUILTIN(__builtin_ia32_cvtpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 931726a78dae9..4ccf0b1ac69b3 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -14355,6 +14355,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned 
BuiltinID,
   case X86::BI__builtin_ia32_vec_init_v2si:
 return Builder.CreateBitCast(BuildVector(Ops),
  llvm::Type::getX86_MMXTy(getLLVMContext()));
+  case X86::BI__builtin_ia32_vec_ext_v4hi:
   case X86::BI__builtin_ia32_vec_ext_v2si:
   case X86::BI__builtin_ia32_vec_ext_v16qi:
   case X86::BI__builtin_ia32_vec_ext_v8hi:
@@ -14373,6 +14374,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned 
BuiltinID,
 // Otherwise we could just do this in the header file.
 return Builder.CreateExtractElement(Ops[0], Index);
   }
+  case X86::BI__builtin_ia32_vec_set_v4hi:
   case X86::BI__builtin_ia32_vec_set_v16qi:
   case X86::BI__builtin_ia32_vec_set_v8hi:
   case X86::BI__builtin_ia32_vec_set_v4si:
diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h
index e85bfc47aa5cc..02160285d58d6 100644
--- a/clang/lib/Headers/emmintrin.h
+++ b/clang/lib/Headers/emmintrin.h
@@ -49,12 +49,10 @@ typedef __bf16 __m128bh __attribute__((__vector_size__(16), 
__aligned__(16)));
 #endif
 
 /* Define the default 

[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)

2024-06-24 Thread via cfe-commits

llvmbot wrote:



@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang

Author: James Y Knight (jyknight)


Changes

The MMX instruction set is legacy, and the SSE2 variants are in every way 
superior, when they are available -- and they have been available since the 
Pentium 4 was released, 20 years ago.

Therefore, we are switching the "MMX" intrinsics to depend on SSE2, 
unconditionally. This change entirely drops the ability to generate vectorized 
code using compiler intrinsics for chips with MMX but without SSE2: the Intel 
Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as AMD 
K6 and K7 series chips of around the same timeframe. (Note that targeting these 
older CPUs remains supported, simply without the ability to use MMX compiler 
intrinsics.)

Migrating away from the use of MMX also fixes a rather non-obvious requirement 
for users of the intrinsics API. The long-standing programming model for MMX 
requires that the programmer be aware of the x87/MMX mode-switching semantics, 
and manually call _mm_empty() between using any MMX instruction and any x87 FPU 
instruction. If you neglect to, then every future x87 operation will return a 
NaN result. This requirement is not at all obvious to users of these these 
intrinsics, and causes very difficult to detect bugs.

Additionally, in some circumstances, LLVM may reorder x87 and mmx operations 
around each-other, unaware of this mode switching issue. So, even inserting 
_mm_empty() calls appropriately will not always guarantee correct operation.

Eliminating the use of MMX fixes both these latter issues.

(Originally uploaded at https://reviews.llvm.org/D86855)

Works towards issue #41665.

---

Patch is 125.33 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/96540.diff


16 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsX86.def (+2-2) 
- (modified) clang/lib/CodeGen/CGBuiltin.cpp (+2) 
- (modified) clang/lib/Headers/emmintrin.h (+20-22) 
- (modified) clang/lib/Headers/mmintrin.h (+177-128) 
- (modified) clang/lib/Headers/tmmintrin.h (+57-40) 
- (modified) clang/lib/Headers/xmmintrin.h (+89-99) 
- (modified) clang/test/CodeGen/X86/mmx-builtins.c (+109-100) 
- (modified) clang/test/CodeGen/X86/mmx-inline-asm.c (+1-1) 
- (modified) clang/test/CodeGen/X86/mmx-shift-with-immediate.c (+9-9) 
- (modified) clang/test/CodeGen/attr-target-x86-mmx.c (+3-4) 
- (modified) clang/test/Headers/xmmintrin.c (+1-1) 
- (modified) clang/test/Sema/x86-builtin-palignr.c (+1-1) 
- (modified) llvm/include/llvm/IR/IntrinsicsX86.td (+2-2) 
- (added) mmx-tests/Makefile (+29) 
- (added) mmx-tests/mmx-tests.py (+301) 
- (added) mmx-tests/test.c (+237) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsX86.def 
b/clang/include/clang/Basic/BuiltinsX86.def
index 7074479786b97..612673127a376 100644
--- a/clang/include/clang/Basic/BuiltinsX86.def
+++ b/clang/include/clang/Basic/BuiltinsX86.def
@@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", 
"mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse")
 TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse")
-TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse")
+TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse")
+TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse")
 
 // MMX+SSE2
 TARGET_BUILTIN(__builtin_ia32_cvtpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 931726a78dae9..4ccf0b1ac69b3 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -14355,6 +14355,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned 
BuiltinID,
   case X86::BI__builtin_ia32_vec_init_v2si:
 return Builder.CreateBitCast(BuildVector(Ops),
  llvm::Type::getX86_MMXTy(getLLVMContext()));
+  case X86::BI__builtin_ia32_vec_ext_v4hi:
   case X86::BI__builtin_ia32_vec_ext_v2si:
   case X86::BI__builtin_ia32_vec_ext_v16qi:
   case X86::BI__builtin_ia32_vec_ext_v8hi:
@@ -14373,6 +14374,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned 
BuiltinID,
 // Otherwise we could just do this in the header file.
 return Builder.CreateExtractElement(Ops[0], Index);
   }
+  case X86::BI__builtin_ia32_vec_set_v4hi:
   case X86::BI__builtin_ia32_vec_set_v16qi:
   case X86::BI__builtin_ia32_vec_set_v8hi:
   case X86::BI__builtin_ia32_vec_set_v4si:
diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h
index e85bfc47aa5cc..02160285d58d6 100644
--- a/clang/lib/Headers/emmintrin.h
+++ b/clang/lib/Headers/emmintrin.h
@@ -49,12 +49,10 @@ typedef __bf16 __m128bh __attribute__((__vector_size__(16), 
__aligned__(16)));