[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #11 from ubizjak at gmail dot com 2010-09-06 12:05 --- Reopened due to 4.6 regression, see [1]. [1] http://gcc.gnu.org/ml/gcc-testresults/2010-09/msg00529.html -- ubizjak at gmail dot com changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #12 from uros at gcc dot gnu dot org 2010-09-06 17:51 --- Subject: Bug 22152 Author: uros Date: Mon Sep 6 17:51:12 2010 New Revision: 163926 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=163926 Log: PR target/22152 * config/i386/mmx.md (*mov_internal_rex64, *mov_internal_avx, *mov_internal, *movv2sf_internal_rex64_avx, *movv2sf_internal_rex64, *movv2sf_internal_avx, *movv2sf_internal): Split out !y-!y alternative. [ Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/mmx.md -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #13 from uros at gcc dot gnu dot org 2010-09-06 17:55 --- Subject: Bug 22152 Author: uros Date: Mon Sep 6 17:54:46 2010 New Revision: 163927 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=163927 Log: PR target/22152 * gcc.target/i386/pr22152.c (add3): Change "count" to unsigned int. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr22152.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #14 from ubizjak at gmail dot com 2010-09-06 18:13 --- Fixed again. -- ubizjak at gmail dot com changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #4 from ubizjak at gmail dot com 2007-03-01 13:47 --- Current mainline produces really horrible code: .L4: movl(%edx), %ebx addl$1, %eax movl4(%edx), %esi addl$8, %edx movl%ebx, 8(%esp) movl%esi, 12(%esp) movq8(%esp), %mm0 paddq (%ecx), %mm0 addl$8, %ecx cmpl%edi, %eax movq%mm0, 8(%esp) movl8(%esp), %ebx movl12(%esp), %esi jne .L4 This is due to two problems: 1) For some reason, ivopts doesn't use fancy i386 addressing modes. -fno-ivopts produces slightly better code: .L4: movl(%edi,%eax,8), %edx movl4(%edi,%eax,8), %ecx movl%edx, 8(%esp) movl%ecx, 12(%esp) movq8(%esp), %mm0 paddq (%esi,%eax,8), %mm0 addl$1, %eax cmpl%eax, %ebx movq%mm0, 8(%esp) movl8(%esp), %edx movl12(%esp), %ecx ja .L4 2) A DImode register is used in the middle of RTL stream, following to this reload sequence: (insn:HI 21 20 53 4 (set (reg:DI 1 dx) (mem:DI (plus:SI (mult:SI (reg/v:SI 0 ax [orig:59 i ] [59]) (const_int 8 [0x8])) (reg/v/f:SI 5 di [orig:64 a ] [64])) [2 S8 A64])) 56 {*movdi_2} (nil) (nil)) (insn 53 21 54 4 (set (mem/c:DI (plus:SI (reg/f:SI 7 sp) (const_int 8 [0x8])) [5 S8 A8]) (reg:DI 1 dx)) 56 {*movdi_2} (nil) (nil)) (insn 54 53 22 4 (set (reg:DI 29 mm0) (mem/c:DI (plus:SI (reg/f:SI 7 sp) (const_int 8 [0x8])) [5 S8 A8])) 56 {*movdi_2} (nil) (nil)) (insn:HI 22 54 55 4 (set (reg:DI 29 mm0) (unspec:DI [ (plus:DI (reg:DI 29 mm0) (mem:DI (plus:SI (mult:SI (reg/v:SI 0 ax [orig:59 i ] [59]) (const_int 8 [0x8])) (reg/v/f:SI 4 si [orig:65 b ] [65])) [2 S8 A64])) ] 38)) 612 {mmx_adddi3} (insn_list:REG_DEP_TRUE 21 (nil)) (nil)) DImode register in insn 21 gets allocated to dx/cx DImode pair, but insn 22 wants mmx register. Reload then inserts insn 53 and 54 to satisfy input and output constraints. The same story repeats at the end of the loop, but this time dx/cx gets allocated to V2SImode pseudo (?!): (insn:HI 24 55 26 4 (set (reg/v:V2SI 1 dx [orig:60 sum ] [60]) (mem/c:V2SI (plus:SI (reg/f:SI 7 sp) (const_int 8 [0x8])) [5 S8 A8])) 581 {*movv2si_internal} (insn_list:REG_DEP_TRUE 22 (nil)) (nil)) In above case, mm0 register (in DImode) gets reloaded from mm0 (V2SImode) via memory. It looks that mmx DImode _really_ upsets register allocator as it can be allocated to either si/si register pair or to mmx register. Perhaps we need V1DI mode to separate pure DImodes (either 2*32bit for i686 or 64bit for x86_64) from mmx DImodes. It is possible to change delicate allocation balance by changing register preferences in movdi_2 and mov_internal MMX move patterns, but we really need more robust solution for this problem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #5 from uros at gcc dot gnu dot org 2008-03-08 07:00 --- Subject: Bug 22152 Author: uros Date: Sat Mar 8 06:59:33 2008 New Revision: 133023 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133023 Log: 2008-03-08 Uros Bizjak <[EMAIL PROTECTED]> PR target/22152 * config/i386/i386-modes.def (V1DI): New vector mode. * config/i386/i386.h (VALID_MMX_REG_MODE): Add V1DImode. * config/i386/mmx.md (MMXMODEI8): New mode iterator. (MMXMODE248): Ditto. (MMXMODE): Add V1DI mode. (mmxvecsize): Change DI mode to V1DI mode. ("mov): Use MMXMODEI8 mode iterator. ("*mov_internal_rex64"): Ditto. ("*mov_internal"): Ditto. ("mmx_add3"): Ditto. Handle V1DImode for TARGET_SSE2. ("mmx_sub3"): Ditto. ("mmx_adddi3"): Remove insn pattern. ("mmx_subdi3"): Ditto. ("mmx_ashr3"): Use SImode and "yN" constraint for operand 2. ("mmx_lshr3"): Ditto. Use MMXMODE248 mode iterator. ("mmx_ashl3"): Ditto. ("mmx_lshrdi3"): Remove insn pattern. ("mmx_ashldi3"): Ditto. * config/i386/i386.c (classify_argument): Handle V1DImode. (function_arg_advance_32): Ditto. (function_arg_32): Ditto. (struct builtin_description) [IX86_BUILTIN_PADDQ]: Use mmx_addv1di3 insn pattern. [IX86_BUILTIN_PSUBQ]: Use mmx_subv1di3 insn pattern. [IX86_BUILTIN_PSLL?, IX86_BUILTIN_PSRL?, IX86_BUILTIN_PSRA?, IX86_BUILTIN_PSLL?I, IX86_BUILTIN_PSRL?I, IX86_BUILTIN_PSRA?I, IX86_BUILTIN_PSLL?I128, IX86_BUILTIN_PSRL?I128, IX86_BUILTIN_PSRA?I128]: Remove definitions of built-in functions. (V1DI_type_node): New node. (v1di_ftype_v1di_int): Ditto. (v1di_ftype_v1di_v1di): Ditto. (v2si_ftype_v2si_si): Ditto. (v4hi_ftype_v4hi_di): Remove node. (v2si_ftype_v2si_di): Ditto. (ix86_init_mmx_sse_builtins): Handle V1DImode. (__builtin_ia32_psll?, __builtin_ia32_psrl?, __builtin_ia32_psra?): Redefine builtins using def_builtin_const with *_ftype_*_int node. (__builtin_ia32_psll?i, __builtin_ia32_psrl?i, __builtin_ia32_psra?i): Add new builtins using def_builtin_const. (ix86_expand_builtin) [IX86_BUILTIN_PSLL?, IX86_BUILTIN_PSRL?, IX86_BUILTIN_PSRA?, IX86_BUILTIN_PSLL?I, IX86_BUILTIN_PSRL?I, IX86_BUILTIN_PSRA?I]: Handle builtin definitions. * config/i386/mmintrin.h (__v1di): New typedef. (_mm_add_si64): Cast arguments to __v1di type. (_mm_sub_si64): Ditto. (_mm_sll_pi16): Cast __count to __v4hi type. (_mm_sll_pi32): Cast __count to __v2si type. (_mm_sll_si64): Cast arguments to __v1di type. (_mm_srl_pi16): Cast __count to __v4hi type. (_mm_srl_pi32): Cast __count to __v2si type. (_mm_srl_si64): Cast arguments to __v1di type. (_mm_sra_pi16): Cast __count to __v4hi type. (_mm_sra_pi32): Cast __count to __v2si type. (_mm_slli_pi16): Use __builtin_ia32_psllwi. (_mm_slli_pi32): Use __builtin_ia32_pslldi. (_mm_slli_si64): Use __builtin_ia32_psllqi. Cast __m to __v1di type. (_mm_srli_pi16): Use __builtin_ia32_psrlwi. (_mm_srli_pi32): Use __builtin_ia32_psrldi. (_mm_srli_si64): Use __builtin_ia32_psrlqi. Cast __m to __v1di type. (_mm_srai_pi16): Use __builtin_ia32_psrawi. (_mm_srai_pi32): Use __builtin_ia32_psradi. * config/i386/i386.md (UNSPEC_NOP): Remove unspec definition. * doc/extend.texi (X86 Built-in Functions) [__builtin_ia32_psll?, __builtin_ia32_psrl?, __builtin_ia32_psra?, __builtin_ia32_psll?i, __builtin_ia32_psrl?i, __builtin_ia32_psra?i]: Add new builtins. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-modes.def trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/mmintrin.h trunk/gcc/config/i386/mmx.md trunk/gcc/doc/extend.texi -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #6 from ubizjak at gmail dot com 2008-03-08 07:09 --- Patch at http://gcc.gnu.org/ml/gcc-patches/2008-03/msg00517.html. New asm for the original testcase is now much better [see patch post]. -- ubizjak at gmail dot com changed: What|Removed |Added URL||http://gcc.gnu.org/ml/gcc- ||patches/2008- ||03/msg00517.html http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #7 from ubizjak at gmail dot com 2008-03-08 07:21 --- For updated testcase: typedef int __m64 __attribute__ ((__vector_size__ (8))); typedef long long __v1di __attribute__ ((__vector_size__ (8))); __m64 unsigned_add3 (const __m64 * a, const __m64 * b, unsigned long count) { __m64 sum; unsigned int i; for (i = 1; i < count; i++) sum = (__m64) __builtin_ia32_paddq ((__v1di) a[i], (__v1di) b[i]); return sum; } we now generate: .L3: addl$1, %eax movq(%edx), %mm0 addl$8, %edx paddq (%ecx), %mm0 addl$8, %ecx cmpl%eax, %ebx ja .L3 and adding -mno-ivopts we are back to optimal code: .L3: movq(%ebx,%eax,8), %mm0 paddq (%ecx,%eax,8), %mm0 addl$1, %eax cmpl%eax, %edx ja .L3 So, fixed. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #8 from ubizjak at gmail dot com 2008-03-08 07:22 --- Fixed for real. -- ubizjak at gmail dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED Target Milestone|--- |4.4.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #9 from uros at gcc dot gnu dot org 2008-03-08 12:44 --- Subject: Bug 22152 Author: uros Date: Sat Mar 8 12:43:13 2008 New Revision: 133034 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133034 Log: PR target/22152 * gcc.target/i386/pr22152.c: New test. * gcc.target/i386/sse2-mmx.c: Ditto. Added: trunk/gcc/testsuite/gcc.target/i386/pr22152.c trunk/gcc/testsuite/gcc.target/i386/sse2-mmx.c Modified: trunk/gcc/testsuite/ChangeLog -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Comment #10 from ubizjak at gmail dot com 2008-03-08 12:51 --- BTW: The larger testcase from the Comment 0 should add two numbers together, but the carry propagation logic in the loop is fatally flawed. The testcase that was added to the testsuite [1] fixes this problem. [1] http://gcc.gnu.org/viewcvs/trunk/gcc/testsuite/gcc.target/i386/sse2-mmx.c?revision=133034&view=markup&pathrev=133034 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Additional Comments From pinskia at gcc dot gnu dot org 2005-06-22 20:26 --- This comes down to PR 14552 and PR 19161. Since this is MMX code, it is very hard to get correct as GCC does not currently output emms and vector code without you doing it which is what PR 19161 is about. This is a target problem. Also see PR 22076. -- What|Removed |Added BugsThisDependsOn||14552, 19161, 22076 Status|UNCONFIRMED |NEW Component|rtl-optimization|target Ever Confirmed||1 Last reconfirmed|-00-00 00:00:00 |2005-06-22 20:26:23 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152
[Bug target/22152] Poor loop optimization when using mmx builtins
--- Additional Comments From fjahanian at apple dot com 2005-09-13 00:52 --- Has there been any progress toward fixing the problems addressed by these PRs? - thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22152