[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
jyknight wrote: > Really, the question is whether we plan to completely drop support for the > x86_mmx type (including inline asm operands/results) Yes, I do think it would be good to eliminate the type. For inline-asm, we could switch to using a standard IR vector type for "y" constraint operands/results, and teach IR lowering to copy to/from MMX registers at the border. This is basically what Clang does already, at the IR level; we'd be pushing that down into IR lowering. It would have some minor performance impact on anything passing MMX values directly between two inline-asm statements (redundant movdq2q/movq2dq), but, so far as I can tell, almost nobody ever uses the "y" constraint anyways -- mmx registers are more often loaded/stored to memory inside of an inline-asm, instead. Also, clearly nothing still using MMX can be _that_ performance sensitive, or it would've been migrated to SSE/AVX sometime in the last 20 years. One more option which is made trivial by eliminating the x86_mmx IR type would be to insert an "emms" after the return-value extraction for all inline-asm statements which are marked with either mmx clobbers or a "y" constraint. It would be trivial at that point -- there'd no longer be any need for any special logic to track where to insert it, since we can be sure there is not any live MMX state to worry about. That comes with more potential for performance impact, of course. (but, again, maybe that doesn't actually matter). https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -0,0 +1,29 @@ +USE_XMM= phoebewang wrote: How about we move these old implementations in to a seperate file (or leave them where they are if you like) and rename them to _dept, so that we don't rely on old compilers? We can then remove them as well as the tests in the next release. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); +/* Unsigned types */ +typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); +typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned short __v4hu __attribute__((__vector_size__(8))); +typedef unsigned char __v8qu __attribute__((__vector_size__(8))); + +/* We need an explicitly signed variant for char. Note that this shouldn't + * appear in the interface though. */ +typedef signed char __v8qs __attribute__((__vector_size__(8))); + +/* SSE/SSE2 types */ +typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef int __v4si __attribute__((__vector_size__(16))); +typedef short __v8hi __attribute__((__vector_size__(16))); +typedef char __v16qi __attribute__((__vector_size__(16))); + /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS \ - __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), \ - __min_vector_width__(64))) +#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(64))) phoebewang wrote: They are used to identify the maximum registers a function will use. The backend doesn't distinguish the difference between scalar and 128-bit for now. But we should make it comply the rule here. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
efriedma-quic wrote: > Or, if we do need to preserve bitcode compat, how to best achieve it? Perhaps > we convert them into inline-asm in the bitcode upgrader? Really, the question is whether we plan to completely drop support for the x86_mmx type (including inline asm operands/results). If we don't, then there's not much reason to touch the LLVM IR builtins; there isn't any actual "code" outside of TableGen, so there's basically zero maintenance work. If we do drop it, then autoupgrade becomes very complicated and not really useful. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
jyknight wrote: > I guess the clang calling convention code never uses MMX types for > passing/returning values? Correct, Clang never uses MMX types in its calling convention. This is actually _wrong_ for the 32-bit x86 psABI. You're supposed to pass the first 3 MMX args in mm0-2, and return the first MMX value in mm0. Yet...conflicting with those statements, it also states that all functions MUST be entered in x87 mode, and that you must call emms before returning. _shrug_. We did attempt to implement the arg/return-passing rules for MMX in llvm/lib/Target/X86/X86CallingConv.td, but it doesn't actually apply to the IR Clang emits, since Clang never uses the `x87mmx` type, except as needed around the MMX LLVM-builtins, and inline-asm. Anyhow, I propose that we _do not_ attempt to fix Clang's ABI to conform with the 32-bit psABI. We've gotten it wrong for a decade, and at this point, "fixing" it to use MMX registers it would be worse than not doing so. > Have you looked at the code quality? #41665 mentions potential issues with > widening vectors. I've glanced at it. In optimized code, the codegen looks pretty good. Unoptimized code looks pretty bad _before_ my changes, and looks about the same after. I have not attempted to measure performance of any MMX-intrinsics-using code. > This doesn't touch inline asm or _mm_empty; I guess you're leaving that for a > followup? Correct. That needs additional work, which I have not done. I do plan to add to this PR another commit that deletes all the `__builtin_*` MMX functions, which are no longer used, after the header changes here. However, that will leave all those MMX intrinsics existing still on the LLVM side, and I'm not sure how to go about removing those. Should we just do it, and break bitcode backwards-compatibility for those files? Or, if we do need to preserve bitcode compat, how to best achieve it? Perhaps we convert them into inline-asm in the bitcode upgrader? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
https://github.com/jyknight edited https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); jyknight wrote: IIUC, these files intentionally don't have any dependency on subtarget preprocessor defines, in order that they can be used from functions with target attributes. So I'm not sure if adding an `#ifndef __SSE2__` would be acceptable here? The current error is: ``` echo $'#include \n__m64 f() { return _mm_cvtsi32_si64(5); }' | build/bin/clang -march=pentium3 -m32 -S -o - -xc - :2:20: error: always_inline function '_mm_cvtsi32_si64' requires target feature 'sse2', but would be inlined into function 'f' that is compiled without support for 'sse2' 2 | __m64 f() { return _mm_cvtsi32_si64(5); } |^ 1 error generated. => exit status: 1 ``` https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -177,7 +175,10 @@ _mm_abs_epi32(__m128i __a) /// \returns A 64-bit integer vector containing the concatenated right-shifted ///value. #define _mm_alignr_pi8(a, b, n) \ - ((__m64)__builtin_ia32_palignr((__v8qi)(__m64)(a), (__v8qi)(__m64)(b), (n))) + ((__m64)__builtin_shufflevector( \ jyknight wrote: Cannot use it in a macro for external use, because __trunc64 is undefed at the end of the header. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -614,12 +623,15 @@ _mm_shuffle_epi8(__m128i __a, __m128i __b) ///1: Clear the corresponding byte in the destination. \n ///0: Copy the selected source byte to the corresponding byte in the ///destination. \n -///Bits [3:0] select the source byte to be copied. +///Bits [2:0] select the source byte to be copied. /// \returns A 64-bit integer vector containing the copied or cleared values. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_shuffle_pi8(__m64 __a, __m64 __b) { -return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b); +return __trunc64(__builtin_ia32_pshufb128( +(__v16qi)__builtin_shufflevector( +(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1), jyknight wrote: The behavior is supposed to be that only the bottom 3 bits of `__b` affect the result for the 64-bit operation. For the 128-bit operation, however, the bottom 4 bits are used. By duplicating the input vector, we automatically get the proper behavior without having to mask out bit 3 of `__b`. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -242,10 +243,11 @@ _mm_hadd_epi32(__m128i __a, __m128i __b) ///destination. /// \returns A 64-bit vector of [4 x i16] containing the horizontal sums of both ///operands. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_hadd_pi16(__m64 __a, __m64 __b) { -return (__m64)__builtin_ia32_phaddw((__v4hi)__a, (__v4hi)__b); +return __extract2_32(__builtin_ia32_phaddw128((__v8hi)__anyext128(__a), jyknight wrote: Done. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -32,12 +32,13 @@ typedef unsigned int __v4su __attribute__((__vector_size__(16))); #endif /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS \ - __attribute__((__always_inline__, __nodebug__, __target__("sse,no-evex512"), \ - __min_vector_width__(128))) -#define __DEFAULT_FN_ATTRS_MMX \ - __attribute__((__always_inline__, __nodebug__, \ - __target__("mmx,sse,no-evex512"), __min_vector_width__(64))) +#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse,no-evex512"), __min_vector_width__(128))) +#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(64))) + +#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0) +#define __zext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, 2, 3) jyknight wrote: Vectors follow the same init-list rules as array/aggregate, where unmentioned elements are zeroed. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -0,0 +1,29 @@ +USE_XMM= +#USE_XMM=--use-xmm jyknight wrote: It works if you first delete/ifdef-out test_stores and test_maskmove. I didn't make those generic. (The "USE_XMM" version is to validate that the unused bits of the input xmm registers being set to arbitrary values won't cause trouble, since some of the instructions trigger side-effects, like floating-point exception flags being set.) https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -17,13 +17,11 @@ #include /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS \ - __attribute__((__always_inline__, __nodebug__, \ - __target__("ssse3,no-evex512"), __min_vector_width__(64))) -#define __DEFAULT_FN_ATTRS_MMX \ - __attribute__((__always_inline__, __nodebug__, \ - __target__("mmx,ssse3,no-evex512"), \ - __min_vector_width__(64))) +#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("ssse3,no-evex512"), __min_vector_width__(64))) jyknight wrote: Done. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -1558,10 +1559,10 @@ _mm_cvttss_si64(__m128 __a) /// \param __a ///A 128-bit vector of [4 x float]. /// \returns A 64-bit integer vector containing the converted values. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_cvttps_pi32(__m128 __a) { - return (__m64)__builtin_ia32_cvttps2pi((__v4sf)__a); + return __trunc64(__builtin_ia32_cvttps2dq((__v4sf)__zeroupper64(__a))); jyknight wrote: I'm not sure: is `__builtin_convertvector` from float->int guaranteed to have the same semantics as this requires? Even if feasible, I'd prefer to leave that change to some future work that eliminates the `__builtin_ia32_cvttps2dq` (and similar functions), since the same should be done to `_mm_cvttps_epi32`, `_mm256_cvttps_epi32`, `_mm_cvtpd_epi32`, `_mm_cvtpd_pi32`, and `_mm256_cvtpd_epi32`, at least. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); +/* Unsigned types */ +typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); +typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned short __v4hu __attribute__((__vector_size__(8))); +typedef unsigned char __v8qu __attribute__((__vector_size__(8))); + +/* We need an explicitly signed variant for char. Note that this shouldn't + * appear in the interface though. */ +typedef signed char __v8qs __attribute__((__vector_size__(8))); + +/* SSE/SSE2 types */ +typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef int __v4si __attribute__((__vector_size__(16))); +typedef short __v8hi __attribute__((__vector_size__(16))); +typedef char __v16qi __attribute__((__vector_size__(16))); + /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS \ - __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), \ - __min_vector_width__(64))) +#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(64))) jyknight wrote: Done; both ehre and xmmintrin.h/tmmintrin.h. I don't actually know what effect setting min_vector_width has, vs only setting the target to sse2, but I also note that tmmintrin.h appears to have been incorrect before my changes. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); +/* Unsigned types */ +typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); +typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned short __v4hu __attribute__((__vector_size__(8))); +typedef unsigned char __v8qu __attribute__((__vector_size__(8))); + +/* We need an explicitly signed variant for char. Note that this shouldn't + * appear in the interface though. */ +typedef signed char __v8qs __attribute__((__vector_size__(8))); + +/* SSE/SSE2 types */ +typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef int __v4si __attribute__((__vector_size__(16))); +typedef short __v8hi __attribute__((__vector_size__(16))); +typedef char __v16qi __attribute__((__vector_size__(16))); jyknight wrote: Users can include mmintrin.h by itself, though, so I had to copy them here. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -0,0 +1,29 @@ +USE_XMM= jyknight wrote: These are the tests I used to validate that the re-implementation works properly: I found it extremely important to have validation that my implementation works properly, because i certainly messed up a few times. I did upload it intentionally, but I don't propose to commit as-is -- especially not in this location. I definitely would like to commit something along these lines _somewhere_, however, so that future maintainers have the benefit of being able to test this functionality. Probably it belongs in the llvm-test-suite repository? I'd love help on getting this into a submittable form. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -124,10 +143,11 @@ _mm_cvtm64_si64(__m64 __m) ///written to the upper 32 bits of the result. /// \returns A 64-bit integer vector of [8 x i8] containing the converted ///values. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_packs_pi16(__m64 __m1, __m64 __m2) { -return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2); +return __extract2_32(__builtin_ia32_packsswb128((__v8hi)__anyext128(__m1), jyknight wrote: So, the current version assembles to: ``` 0: 66 0f 63 c1 packsswb%xmm1, %xmm0 4: 66 0f 70 c0 e8pshufd $0xe8, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3] 9: c3retq ``` You're suggesting to instead shuffle the inputs, like: ``` return __trunc64(__builtin_ia32_packsswb128( __builtin_shufflevector(__m1, __m2, 0, 1), (__v8hi){})); ``` I agree, that's better. Saves 1 byte of code, and also less register pressure. ``` 0: 66 0f 6c c1 punpcklqdq %xmm1, %xmm0# xmm0 = xmm0[0],xmm1[0] 4: 66 0f 63 c0 packsswb%xmm0, %xmm0 8: c3retq ``` Done -- eliminated all uses of `__extract2_32`. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -2502,10 +2509,25 @@ _mm_mulhi_pu16(__m64 __a, __m64 __b) ///A pointer to a 64-bit memory location that will receive the conditionally ///copied integer values. The address of the memory location does not have ///to be aligned. -static __inline__ void __DEFAULT_FN_ATTRS_MMX +static __inline__ void __DEFAULT_FN_ATTRS_SSE2 _mm_maskmove_si64(__m64 __d, __m64 __n, char *__p) { - __builtin_ia32_maskmovq((__v8qi)__d, (__v8qi)__n, __p); + // This is complex, because we need to support the case where __p is pointing + // within the last 15 to 8 bytes of a page. In that case, using a 128-bit + // write might cause a trap where a 64-bit maskmovq would not. (Memory + // locations not selected by the mask bits might still cause traps.) + __m128i __d128 = __anyext128(__d); + __m128i __n128 = __zext128(__n); + if (((__SIZE_TYPE__)__p & 0xfff) >= 4096-15 && + ((__SIZE_TYPE__)__p & 0xfff) <= 4096-8) { jyknight wrote: I believe it's correct as written: we need to ensure that we cross a potential page-protection boundary in exactly the same situations as we would've originally. Since we're now executing a 16-byte write instead of the specified 8-byte write, that means we need to back up by 8 bytes when we're at offsets 15, 14, 13, 12, 11, 10, 9, and 8 before the end of the page. At 16 bytes before, we're guaranteed to be within the page for both writes, and at 7 bytes before, we're guaranteed to cross potential-boundary for both. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -2108,9 +2106,8 @@ static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_add_epi32(__m128i __a, /// \param __b ///A 64-bit integer. /// \returns A 64-bit integer containing the sum of both parameters. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX _mm_add_si64(__m64 __a, -__m64 __b) { - return (__m64)__builtin_ia32_paddq((__v1di)__a, (__v1di)__b); +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_add_si64(__m64 __a, __m64 __b) { + return (__m64)(((unsigned long long)__a) + ((unsigned long long)__b)); jyknight wrote: This must use an unsigned add, not a signed add, because wraparound must be defined-behavior. The same rationale applies to the other addition/subtraction cases you've noted in various places below, so I'm going to just mark those conversations as resolved instead of responding to each separately. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -1035,10 +1077,11 @@ _mm_srli_pi32(__m64 __m, int __count) /// \param __count ///A 64-bit integer vector interpreted as a single 64-bit integer. /// \returns A 64-bit integer vector containing the right-shifted value. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_srl_si64(__m64 __m, __m64 __count) { -return (__m64)__builtin_ia32_psrlq((__v1di)__m, __count); +return __trunc64(__builtin_ia32_psrlq128((__v2di)__anyext128(__m), + __anyext128(__count))); jyknight wrote: Done. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
https://github.com/jyknight commented: Thanks for the detailed review comments! https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse") +TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse") +TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse") jyknight wrote: The only purpose served by the `__builtin_ia32_vec_ext/set` is to emit a diagnostic if the immediate value is out of range. They are, otherwise, generic extract/insert operations, which could be just as well spelled `((__v4hi)a)[n]` and `((__v4hi)a)[n] = d;` -- they do not lower to a target-specific LLVM intrinsic. This change makes the MMX ones consistent with the rest -- previously the MMX ones _did_ lower to an LLVM intrinsic. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
https://github.com/jyknight edited https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); RKSimon wrote: Add a deprecation message if SSE2 is not defined or just rely on the __DEFAULT_FN_ATTRS_SSE2 function attribute error message? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -614,12 +623,15 @@ _mm_shuffle_epi8(__m128i __a, __m128i __b) ///1: Clear the corresponding byte in the destination. \n ///0: Copy the selected source byte to the corresponding byte in the ///destination. \n -///Bits [3:0] select the source byte to be copied. +///Bits [2:0] select the source byte to be copied. /// \returns A 64-bit integer vector containing the copied or cleared values. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_shuffle_pi8(__m64 __a, __m64 __b) { -return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b); +return __trunc64(__builtin_ia32_pshufb128( +(__v16qi)__builtin_shufflevector( +(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1), RKSimon wrote: MMX pshufb only uses the first 3-bits for index while SSE uses 4-bits - by splatting the subvector we avoid having to AND the mask input as it can reference either lo/hi subvector. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
efriedma-quic wrote: I guess the clang calling convention code never uses MMX types for passing/returning values? Have you looked at the code quality? #41665 mentions potential issues with widening vectors. This doesn't touch inline asm or _mm_empty; I guess you're leaving that for a followup? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -2502,10 +2509,25 @@ _mm_mulhi_pu16(__m64 __a, __m64 __b) ///A pointer to a 64-bit memory location that will receive the conditionally ///copied integer values. The address of the memory location does not have ///to be aligned. -static __inline__ void __DEFAULT_FN_ATTRS_MMX +static __inline__ void __DEFAULT_FN_ATTRS_SSE2 _mm_maskmove_si64(__m64 __d, __m64 __n, char *__p) { - __builtin_ia32_maskmovq((__v8qi)__d, (__v8qi)__n, __p); + // This is complex, because we need to support the case where __p is pointing + // within the last 15 to 8 bytes of a page. In that case, using a 128-bit + // write might cause a trap where a 64-bit maskmovq would not. (Memory + // locations not selected by the mask bits might still cause traps.) + __m128i __d128 = __anyext128(__d); + __m128i __n128 = __zext128(__n); + if (((__SIZE_TYPE__)__p & 0xfff) >= 4096-15 && + ((__SIZE_TYPE__)__p & 0xfff) <= 4096-8) { phoebewang wrote: `<` ? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -494,10 +520,10 @@ _mm_adds_pu16(__m64 __m1, __m64 __m2) ///A 64-bit integer vector of [8 x i8] containing the subtrahends. /// \returns A 64-bit integer vector of [8 x i8] containing the differences of ///both parameters. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_sub_pi8(__m64 __m1, __m64 __m2) { -return (__m64)__builtin_ia32_psubb((__v8qi)__m1, (__v8qi)__m2); +return (__m64)(((__v8qu)__m1) - ((__v8qu)__m2)); phoebewang wrote: Seems you intended to use unsigned version. What's the reason to do so? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -0,0 +1,29 @@ +USE_XMM= phoebewang wrote: What these tests used for? Is your local tool uploaded unintentionally or you want them to be reviewed as well? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -124,10 +143,11 @@ _mm_cvtm64_si64(__m64 __m) ///written to the upper 32 bits of the result. /// \returns A 64-bit integer vector of [8 x i8] containing the converted ///values. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_packs_pi16(__m64 __m1, __m64 __m2) { -return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2); +return __extract2_32(__builtin_ia32_packsswb128((__v8hi)__anyext128(__m1), phoebewang wrote: Should it be better to shuffle `__m1` and `__m2` first and then `__trunc64`? The same for below. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -177,7 +175,10 @@ _mm_abs_epi32(__m128i __a) /// \returns A 64-bit integer vector containing the concatenated right-shifted ///value. #define _mm_alignr_pi8(a, b, n) \ - ((__m64)__builtin_ia32_palignr((__v8qi)(__m64)(a), (__v8qi)(__m64)(b), (n))) + ((__m64)__builtin_shufflevector( \ phoebewang wrote: __trunc64? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); +/* Unsigned types */ +typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); +typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned short __v4hu __attribute__((__vector_size__(8))); +typedef unsigned char __v8qu __attribute__((__vector_size__(8))); + +/* We need an explicitly signed variant for char. Note that this shouldn't + * appear in the interface though. */ +typedef signed char __v8qs __attribute__((__vector_size__(8))); + +/* SSE/SSE2 types */ +typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef int __v4si __attribute__((__vector_size__(16))); +typedef short __v8hi __attribute__((__vector_size__(16))); +typedef char __v16qi __attribute__((__vector_size__(16))); + /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS \ - __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), \ - __min_vector_width__(64))) +#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(64))) phoebewang wrote: Since we use SSE vector instructions, we need to see `__min_vector_width__(128)` https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -811,10 +843,11 @@ _mm_slli_pi32(__m64 __m, int __count) ///A 64-bit integer vector interpreted as a single 64-bit integer. /// \returns A 64-bit integer vector containing the left-shifted value. If /// \a __count is greater or equal to 64, the result is set to 0. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_sll_si64(__m64 __m, __m64 __count) { -return (__m64)__builtin_ia32_psllq((__v1di)__m, __count); +return __trunc64(__builtin_ia32_psllq128((__v2di)__anyext128(__m), + __anyext128(__count))); phoebewang wrote: Missing `(__v2di)`? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -614,12 +623,15 @@ _mm_shuffle_epi8(__m128i __a, __m128i __b) ///1: Clear the corresponding byte in the destination. \n ///0: Copy the selected source byte to the corresponding byte in the ///destination. \n -///Bits [3:0] select the source byte to be copied. +///Bits [2:0] select the source byte to be copied. /// \returns A 64-bit integer vector containing the copied or cleared values. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_shuffle_pi8(__m64 __a, __m64 __b) { -return (__m64)__builtin_ia32_pshufb((__v8qi)__a, (__v8qi)__b); +return __trunc64(__builtin_ia32_pshufb128( +(__v16qi)__builtin_shufflevector( +(__v2si)(__a), __extension__ (__v2si){}, 0, 1, 0, 1), phoebewang wrote: Why don't `__anyext128`? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse") +TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse") +TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse") phoebewang wrote: I see there's only one use of them, why don't use __trunc64/__anyext128 for them too? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -1035,10 +1077,11 @@ _mm_srli_pi32(__m64 __m, int __count) /// \param __count ///A 64-bit integer vector interpreted as a single 64-bit integer. /// \returns A 64-bit integer vector containing the right-shifted value. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_srl_si64(__m64 __m, __m64 __count) { -return (__m64)__builtin_ia32_psrlq((__v1di)__m, __count); +return __trunc64(__builtin_ia32_psrlq128((__v2di)__anyext128(__m), + __anyext128(__count))); phoebewang wrote: ditto. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -242,10 +243,11 @@ _mm_hadd_epi32(__m128i __a, __m128i __b) ///destination. /// \returns A 64-bit vector of [4 x i16] containing the horizontal sums of both ///operands. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_hadd_pi16(__m64 __a, __m64 __b) { -return (__m64)__builtin_ia32_phaddw((__v4hi)__a, (__v4hi)__b); +return __extract2_32(__builtin_ia32_phaddw128((__v8hi)__anyext128(__a), phoebewang wrote: Shuffle first and then __trunc64? The same below. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -337,10 +363,10 @@ _mm_unpacklo_pi32(__m64 __m1, __m64 __m2) ///A 64-bit integer vector of [8 x i8]. /// \returns A 64-bit integer vector of [8 x i8] containing the sums of both ///parameters. -static __inline__ __m64 __DEFAULT_FN_ATTRS +static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_add_pi8(__m64 __m1, __m64 __m2) { -return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2); +return (__m64)(((__v8qu)__m1) + ((__v8qu)__m2)); phoebewang wrote: `__v8qi`? Same for below. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); +/* Unsigned types */ +typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); +typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned short __v4hu __attribute__((__vector_size__(8))); +typedef unsigned char __v8qu __attribute__((__vector_size__(8))); + +/* We need an explicitly signed variant for char. Note that this shouldn't + * appear in the interface though. */ +typedef signed char __v8qs __attribute__((__vector_size__(8))); + +/* SSE/SSE2 types */ +typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef int __v4si __attribute__((__vector_size__(16))); +typedef short __v8hi __attribute__((__vector_size__(16))); +typedef char __v16qi __attribute__((__vector_size__(16))); phoebewang wrote: These already defined in emmintrin.h https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -2539,9 +2536,8 @@ static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_sub_epi32(__m128i __a, ///A 64-bit integer vector containing the subtrahend. /// \returns A 64-bit integer vector containing the difference of the values in ///the operands. -static __inline__ __m64 __DEFAULT_FN_ATTRS_MMX _mm_sub_si64(__m64 __a, -__m64 __b) { - return (__m64)__builtin_ia32_psubq((__v1di)__a, (__v1di)__b); +static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_sub_si64(__m64 __a, __m64 __b) { + return (__m64)((unsigned long long)__a - (unsigned long long)__b); phoebewang wrote: ditto. https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
@@ -21,10 +21,29 @@ typedef int __v2si __attribute__((__vector_size__(8))); typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); +/* Unsigned types */ +typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); +typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned short __v4hu __attribute__((__vector_size__(8))); +typedef unsigned char __v8qu __attribute__((__vector_size__(8))); + +/* We need an explicitly signed variant for char. Note that this shouldn't + * appear in the interface though. */ +typedef signed char __v8qs __attribute__((__vector_size__(8))); + +/* SSE/SSE2 types */ +typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef int __v4si __attribute__((__vector_size__(16))); +typedef short __v8hi __attribute__((__vector_size__(16))); +typedef char __v16qi __attribute__((__vector_size__(16))); + /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS \ - __attribute__((__always_inline__, __nodebug__, __target__("mmx,no-evex512"), \ - __min_vector_width__(64))) +#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(64))) phoebewang wrote: Don't need explicit `_SSE2`? https://github.com/llvm/llvm-project/pull/96540 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
github-actions[bot] wrote: :warning: C/C++ code formatter, clang-format found issues in your code. :warning: You can test this locally with the following command: ``bash git-clang-format --diff af6acd7442646fde56de919964bd52d7bb7922b2 a17a0df1c3551693283dd806b901d3020f33e67f --extensions 'c,h,cpp' -- mmx-tests/test.c clang/lib/CodeGen/CGBuiltin.cpp clang/lib/Headers/emmintrin.h clang/lib/Headers/mmintrin.h clang/lib/Headers/tmmintrin.h clang/lib/Headers/xmmintrin.h clang/test/CodeGen/X86/mmx-builtins.c clang/test/CodeGen/X86/mmx-inline-asm.c clang/test/CodeGen/X86/mmx-shift-with-immediate.c clang/test/CodeGen/attr-target-x86-mmx.c clang/test/Headers/xmmintrin.c clang/test/Sema/x86-builtin-palignr.c `` View the diff from clang-format here. ``diff diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h index 02160285d5..a3176570a4 100644 --- a/clang/lib/Headers/emmintrin.h +++ b/clang/lib/Headers/emmintrin.h @@ -49,10 +49,15 @@ typedef __bf16 __m128bh __attribute__((__vector_size__(16), __aligned__(16))); #endif /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(128))) - -#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0) -#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1) +#define __DEFAULT_FN_ATTRS \ + __attribute__((__always_inline__, __nodebug__, \ + __target__("sse2,no-evex512"), __min_vector_width__(128))) + +#define __trunc64(x) \ + (__m64) __builtin_shufflevector((__v2di)(x), __extension__(__v2di){}, 0) +#define __anyext128(x) \ + (__m128i) __builtin_shufflevector((__v2si)(x), __extension__(__v2si){}, 0, \ +1, -1, -1) /// Adds lower double-precision values in both operands and returns the ///sum in the lower 64 bits of the result. The upper 64 bits of the result diff --git a/clang/lib/Headers/mmintrin.h b/clang/lib/Headers/mmintrin.h index 71d7487673..d4ccb3c92f 100644 --- a/clang/lib/Headers/mmintrin.h +++ b/clang/lib/Headers/mmintrin.h @@ -22,8 +22,8 @@ typedef short __v4hi __attribute__((__vector_size__(8))); typedef char __v8qi __attribute__((__vector_size__(8))); /* Unsigned types */ -typedef unsigned long long __v1du __attribute__ ((__vector_size__ (8))); -typedef unsigned int __v2su __attribute__ ((__vector_size__ (8))); +typedef unsigned long long __v1du __attribute__((__vector_size__(8))); +typedef unsigned int __v2su __attribute__((__vector_size__(8))); typedef unsigned short __v4hu __attribute__((__vector_size__(8))); typedef unsigned char __v8qu __attribute__((__vector_size__(8))); @@ -33,17 +33,23 @@ typedef signed char __v8qs __attribute__((__vector_size__(8))); /* SSE/SSE2 types */ typedef long long __m128i __attribute__((__vector_size__(16), __aligned__(16))); -typedef long long __v2di __attribute__ ((__vector_size__ (16))); +typedef long long __v2di __attribute__((__vector_size__(16))); typedef int __v4si __attribute__((__vector_size__(16))); typedef short __v8hi __attribute__((__vector_size__(16))); typedef char __v16qi __attribute__((__vector_size__(16))); /* Define the default attributes for the functions in this file. */ -#define __DEFAULT_FN_ATTRS_SSE2 __attribute__((__always_inline__, __nodebug__, __target__("sse2,no-evex512"), __min_vector_width__(64))) - -#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0) -#define __anyext128(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1) -#define __extract2_32(a) (__m64)__builtin_shufflevector((__v4si)(a), __extension__ (__v4si){}, 0, 2); +#define __DEFAULT_FN_ATTRS_SSE2 \ + __attribute__((__always_inline__, __nodebug__, \ + __target__("sse2,no-evex512"), __min_vector_width__(64))) + +#define __trunc64(x) \ + (__m64) __builtin_shufflevector((__v2di)(x), __extension__(__v2di){}, 0) +#define __anyext128(x) \ + (__m128i) __builtin_shufflevector((__v2si)(x), __extension__(__v2si){}, 0, \ +1, -1, -1) +#define __extract2_32(a) \ + (__m64) __builtin_shufflevector((__v4si)(a), __extension__(__v4si){}, 0, 2); /// Clears the MMX state by setting the state of the x87 stack registers ///to empty. @@ -69,10 +75,8 @@ _mm_empty(void) { ///A 32-bit integer
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
github-actions[bot] wrote: :warning: Python code formatter, darker found issues in your code. :warning: You can test this locally with the following command: ``bash darker --check --diff -r af6acd7442646fde56de919964bd52d7bb7922b2...a17a0df1c3551693283dd806b901d3020f33e67f mmx-tests/mmx-tests.py `` View the diff from darker here. ``diff --- mmx-tests.py2024-06-21 21:25:54.00 + +++ mmx-tests.py2024-06-24 19:29:45.332608 + @@ -1,301 +1,1048 @@ #!/usr/bin/python3 import argparse import sys + # This is a list of all intel functions and macros which take or # return an __m64. def do_mmx(fn): - # mmintrin.h - fn("_mm_cvtsi32_si64", "__m64", ("int", )) - fn("_mm_cvtsi64_si32", "int", ("__m64", )) - fn("_mm_cvtsi64_m64", "__m64", ("long long", ), condition='defined(__X86_64__) || defined(__clang__)') - fn("_mm_cvtm64_si64", "long long", ("__m64", ), condition='defined(__X86_64__) || defined(__clang__)') - fn("_mm_packs_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_packs_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_packs_pu16", "__m64", ("__m64", "__m64", )) - fn("_mm_unpackhi_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_unpackhi_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_unpackhi_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_unpacklo_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_unpacklo_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_unpacklo_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_add_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_add_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_add_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_adds_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_adds_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_adds_pu8", "__m64", ("__m64", "__m64", )) - fn("_mm_adds_pu16", "__m64", ("__m64", "__m64", )) - fn("_mm_sub_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_sub_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_sub_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_subs_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_subs_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_subs_pu8", "__m64", ("__m64", "__m64", )) - fn("_mm_subs_pu16", "__m64", ("__m64", "__m64", )) - fn("_mm_madd_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_mulhi_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_mullo_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_sll_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_slli_pi16", "__m64", ("__m64", "int", )) - fn("_mm_sll_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_slli_pi32", "__m64", ("__m64", "int", )) - fn("_mm_sll_si64", "__m64", ("__m64", "__m64", )) - fn("_mm_slli_si64", "__m64", ("__m64", "int", )) - fn("_mm_sra_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_srai_pi16", "__m64", ("__m64", "int", )) - fn("_mm_sra_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_srai_pi32", "__m64", ("__m64", "int", )) - fn("_mm_srl_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_srli_pi16", "__m64", ("__m64", "int", )) - fn("_mm_srl_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_srli_pi32", "__m64", ("__m64", "int", )) - fn("_mm_srl_si64", "__m64", ("__m64", "__m64", )) - fn("_mm_srli_si64", "__m64", ("__m64", "int", )) - fn("_mm_and_si64", "__m64", ("__m64", "__m64", )) - fn("_mm_andnot_si64", "__m64", ("__m64", "__m64", )) - fn("_mm_or_si64", "__m64", ("__m64", "__m64", )) - fn("_mm_xor_si64", "__m64", ("__m64", "__m64", )) - fn("_mm_cmpeq_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_cmpeq_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_cmpeq_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_cmpgt_pi8", "__m64", ("__m64", "__m64", )) - fn("_mm_cmpgt_pi16", "__m64", ("__m64", "__m64", )) - fn("_mm_cmpgt_pi32", "__m64", ("__m64", "__m64", )) - fn("_mm_setzero_si64", "__m64", ()) - fn("_mm_set_pi32", "__m64", ("int", "int", )) - fn("_mm_set_pi16", "__m64", ("short", "short", "short", "short", )) - fn("_mm_set_pi8", "__m64", ("char", "char", "char", "char", "char", "char", "char", "char", )) - fn("_mm_set1_pi32", "__m64", ("int", )) - fn("_mm_set1_pi16", "__m64", ("short", )) - fn("_mm_set1_pi8", "__m64", ("char", )) - fn("_mm_setr_pi32", "__m64", ("int", "int", )) - fn("_mm_setr_pi16", "__m64", ("short", "short", "short", "short", )) - fn("_mm_setr_pi8", "__m64", ("char", "char", "char", "char", "char", "char", "char", "char", )) - - # xmmintrin.h - fn("_mm_cvtps_pi32", "__m64", ("__m128", )) - fn("_mm_cvt_ps2pi", "__m64", ("__m128", )) - fn("_mm_cvttps_pi32", "__m64", ("__m128", )) - fn("_mm_cvtt_ps2pi", "__m64", ("__m128", )) - fn("_mm_cvtpi32_ps", "__m128", ("__m128", "__m64", )) - fn("_mm_cvt_pi2ps", "__m128", ("__m128", "__m64", )) - fn("_mm_loadh_pi", "__m128", ("__m128", "const __m64 *", )) - fn("_mm_loadl_pi", "__m128", ("__m128", "const __m64 *", )) - fn("_mm_storeh_pi", "void", ("__m64 *", "__m128", )) - fn("_mm_storel_pi", "void", ("__m64 *", "__m128", )) - fn("_mm_stream_pi", "void",
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
llvmbot wrote: @llvm/pr-subscribers-llvm-ir Author: James Y Knight (jyknight) Changes The MMX instruction set is legacy, and the SSE2 variants are in every way superior, when they are available -- and they have been available since the Pentium 4 was released, 20 years ago. Therefore, we are switching the "MMX" intrinsics to depend on SSE2, unconditionally. This change entirely drops the ability to generate vectorized code using compiler intrinsics for chips with MMX but without SSE2: the Intel Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as AMD K6 and K7 series chips of around the same timeframe. (Note that targeting these older CPUs remains supported, simply without the ability to use MMX compiler intrinsics.) Migrating away from the use of MMX also fixes a rather non-obvious requirement for users of the intrinsics API. The long-standing programming model for MMX requires that the programmer be aware of the x87/MMX mode-switching semantics, and manually call _mm_empty() between using any MMX instruction and any x87 FPU instruction. If you neglect to, then every future x87 operation will return a NaN result. This requirement is not at all obvious to users of these these intrinsics, and causes very difficult to detect bugs. Additionally, in some circumstances, LLVM may reorder x87 and mmx operations around each-other, unaware of this mode switching issue. So, even inserting _mm_empty() calls appropriately will not always guarantee correct operation. Eliminating the use of MMX fixes both these latter issues. (Originally uploaded at https://reviews.llvm.org/D86855) Works towards issue #41665. --- Patch is 125.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96540.diff 16 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsX86.def (+2-2) - (modified) clang/lib/CodeGen/CGBuiltin.cpp (+2) - (modified) clang/lib/Headers/emmintrin.h (+20-22) - (modified) clang/lib/Headers/mmintrin.h (+177-128) - (modified) clang/lib/Headers/tmmintrin.h (+57-40) - (modified) clang/lib/Headers/xmmintrin.h (+89-99) - (modified) clang/test/CodeGen/X86/mmx-builtins.c (+109-100) - (modified) clang/test/CodeGen/X86/mmx-inline-asm.c (+1-1) - (modified) clang/test/CodeGen/X86/mmx-shift-with-immediate.c (+9-9) - (modified) clang/test/CodeGen/attr-target-x86-mmx.c (+3-4) - (modified) clang/test/Headers/xmmintrin.c (+1-1) - (modified) clang/test/Sema/x86-builtin-palignr.c (+1-1) - (modified) llvm/include/llvm/IR/IntrinsicsX86.td (+2-2) - (added) mmx-tests/Makefile (+29) - (added) mmx-tests/mmx-tests.py (+301) - (added) mmx-tests/test.c (+237) ``diff diff --git a/clang/include/clang/Basic/BuiltinsX86.def b/clang/include/clang/Basic/BuiltinsX86.def index 7074479786b97..612673127a376 100644 --- a/clang/include/clang/Basic/BuiltinsX86.def +++ b/clang/include/clang/Basic/BuiltinsX86.def @@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse") +TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse") +TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse") // MMX+SSE2 TARGET_BUILTIN(__builtin_ia32_cvtpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 931726a78dae9..4ccf0b1ac69b3 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -14355,6 +14355,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned BuiltinID, case X86::BI__builtin_ia32_vec_init_v2si: return Builder.CreateBitCast(BuildVector(Ops), llvm::Type::getX86_MMXTy(getLLVMContext())); + case X86::BI__builtin_ia32_vec_ext_v4hi: case X86::BI__builtin_ia32_vec_ext_v2si: case X86::BI__builtin_ia32_vec_ext_v16qi: case X86::BI__builtin_ia32_vec_ext_v8hi: @@ -14373,6 +14374,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned BuiltinID, // Otherwise we could just do this in the header file. return Builder.CreateExtractElement(Ops[0], Index); } + case X86::BI__builtin_ia32_vec_set_v4hi: case X86::BI__builtin_ia32_vec_set_v16qi: case X86::BI__builtin_ia32_vec_set_v8hi: case X86::BI__builtin_ia32_vec_set_v4si: diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h index e85bfc47aa5cc..02160285d58d6 100644 --- a/clang/lib/Headers/emmintrin.h +++ b/clang/lib/Headers/emmintrin.h @@ -49,12 +49,10 @@ typedef __bf16 __m128bh __attribute__((__vector_size__(16), __aligned__(16))); #endif /* Define the default
[clang] [llvm] Clang: convert `__m64` intrinsics to unconditionally use SSE2 instead of MMX. (PR #96540)
llvmbot wrote: @llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-clang Author: James Y Knight (jyknight) Changes The MMX instruction set is legacy, and the SSE2 variants are in every way superior, when they are available -- and they have been available since the Pentium 4 was released, 20 years ago. Therefore, we are switching the "MMX" intrinsics to depend on SSE2, unconditionally. This change entirely drops the ability to generate vectorized code using compiler intrinsics for chips with MMX but without SSE2: the Intel Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as AMD K6 and K7 series chips of around the same timeframe. (Note that targeting these older CPUs remains supported, simply without the ability to use MMX compiler intrinsics.) Migrating away from the use of MMX also fixes a rather non-obvious requirement for users of the intrinsics API. The long-standing programming model for MMX requires that the programmer be aware of the x87/MMX mode-switching semantics, and manually call _mm_empty() between using any MMX instruction and any x87 FPU instruction. If you neglect to, then every future x87 operation will return a NaN result. This requirement is not at all obvious to users of these these intrinsics, and causes very difficult to detect bugs. Additionally, in some circumstances, LLVM may reorder x87 and mmx operations around each-other, unaware of this mode switching issue. So, even inserting _mm_empty() calls appropriately will not always guarantee correct operation. Eliminating the use of MMX fixes both these latter issues. (Originally uploaded at https://reviews.llvm.org/D86855) Works towards issue #41665. --- Patch is 125.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96540.diff 16 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsX86.def (+2-2) - (modified) clang/lib/CodeGen/CGBuiltin.cpp (+2) - (modified) clang/lib/Headers/emmintrin.h (+20-22) - (modified) clang/lib/Headers/mmintrin.h (+177-128) - (modified) clang/lib/Headers/tmmintrin.h (+57-40) - (modified) clang/lib/Headers/xmmintrin.h (+89-99) - (modified) clang/test/CodeGen/X86/mmx-builtins.c (+109-100) - (modified) clang/test/CodeGen/X86/mmx-inline-asm.c (+1-1) - (modified) clang/test/CodeGen/X86/mmx-shift-with-immediate.c (+9-9) - (modified) clang/test/CodeGen/attr-target-x86-mmx.c (+3-4) - (modified) clang/test/Headers/xmmintrin.c (+1-1) - (modified) clang/test/Sema/x86-builtin-palignr.c (+1-1) - (modified) llvm/include/llvm/IR/IntrinsicsX86.td (+2-2) - (added) mmx-tests/Makefile (+29) - (added) mmx-tests/mmx-tests.py (+301) - (added) mmx-tests/test.c (+237) ``diff diff --git a/clang/include/clang/Basic/BuiltinsX86.def b/clang/include/clang/Basic/BuiltinsX86.def index 7074479786b97..612673127a376 100644 --- a/clang/include/clang/Basic/BuiltinsX86.def +++ b/clang/include/clang/Basic/BuiltinsX86.def @@ -150,8 +150,8 @@ TARGET_BUILTIN(__builtin_ia32_pmovmskb, "iV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pmulhuw, "V4sV4sV4s", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_psadbw, "V4sV8cV8c", "ncV:64:", "mmx,sse") TARGET_BUILTIN(__builtin_ia32_pshufw, "V4sV4sIc", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "iV4sIi", "ncV:64:", "mmx,sse") -TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4siIi", "ncV:64:", "mmx,sse") +TARGET_BUILTIN(__builtin_ia32_vec_ext_v4hi, "sV4sIi", "ncV:64:", "sse") +TARGET_BUILTIN(__builtin_ia32_vec_set_v4hi, "V4sV4ssIi", "ncV:64:", "sse") // MMX+SSE2 TARGET_BUILTIN(__builtin_ia32_cvtpd2pi, "V2iV2d", "ncV:64:", "mmx,sse2") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 931726a78dae9..4ccf0b1ac69b3 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -14355,6 +14355,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned BuiltinID, case X86::BI__builtin_ia32_vec_init_v2si: return Builder.CreateBitCast(BuildVector(Ops), llvm::Type::getX86_MMXTy(getLLVMContext())); + case X86::BI__builtin_ia32_vec_ext_v4hi: case X86::BI__builtin_ia32_vec_ext_v2si: case X86::BI__builtin_ia32_vec_ext_v16qi: case X86::BI__builtin_ia32_vec_ext_v8hi: @@ -14373,6 +14374,7 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned BuiltinID, // Otherwise we could just do this in the header file. return Builder.CreateExtractElement(Ops[0], Index); } + case X86::BI__builtin_ia32_vec_set_v4hi: case X86::BI__builtin_ia32_vec_set_v16qi: case X86::BI__builtin_ia32_vec_set_v8hi: case X86::BI__builtin_ia32_vec_set_v4si: diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h index e85bfc47aa5cc..02160285d58d6 100644 --- a/clang/lib/Headers/emmintrin.h +++ b/clang/lib/Headers/emmintrin.h @@ -49,12 +49,10 @@ typedef __bf16 __m128bh __attribute__((__vector_size__(16), __aligned__(16)));