[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-04-21 Thread ubizjak at gmail dot com


--- Comment #38 from ubizjak at gmail dot com  2008-04-21 08:21 ---
*** Bug 32301 has been marked as a duplicate of this bug. ***


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 CC||tomash dot brechko at gmail
   ||dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-21 Thread ubizjak at gmail dot com


--- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 ---
(In reply to comment #35)

 Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
 not so much a problem for ffmpeg than it is for others who followed the
 recommandition of intrinsics are better than asm.
 
 About trolling, well i made no attempt to reply politely and diplomatic, no.
 But solving a problem in some use case by droping support for that use
 case is kinda extreem.
 
 The way i see it is that
 * Its non trivial to place emms optimally and automatically
 * there needs to be a emms between mmx code and fpu code
 
 The solutions to this would be any one of
 A. let the programmer place emms like it has been in the past
 B. dont support mmx at all
 C. dont support x87 fpu at all
 D. place emms after every bunch of mmx instructions
 E. solve a quite non trivial problem and place emms optimally
 
 The solution which has been selected apparently is B., why was that choosen?
 Instead of lets say A.?
 
 If i do write SIMD code then i do know that i need an emms on x86. Its
 trivial for the programmer to place it optimally.

I don't know where you get the idea that MMX support was dropped in any way. I
won't engage in a discussion about autovectorisation, intrinsics, builtins,
generic vectorisation, etc, etc with you, but please look at PR 21395 how
performance PR should be filled. The MMX code in that PR is _far_ from trivial,
but since it is well written using intrinsic instructions, it enables
jaw-dropping performance increase that is simply not possible when ASM blocks
are used.

Now, I'm sure that you have your numbers ready to back up your claims from
Comment #33 about performance of generated code, and I challenge you to beat
performance of gcc-4.4 generated code by hand-crafted assembly using the
example of PR 21395.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-21 Thread michaelni at gmx dot at


--- Comment #37 from michaelni at gmx dot at  2008-03-22 02:39 ---
Subject: Re:  compiled trivial vector intrinsic code is
inefficient

On Fri, Mar 21, 2008 at 10:34:00AM -, ubizjak at gmail dot com wrote:
 
 
 --- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 ---
 (In reply to comment #35)
 
  Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone 
  is
  not so much a problem for ffmpeg than it is for others who followed the
  recommandition of intrinsics are better than asm.
  
  About trolling, well i made no attempt to reply politely and diplomatic, no.
  But solving a problem in some use case by droping support for that use
  case is kinda extreem.
  
  The way i see it is that
  * Its non trivial to place emms optimally and automatically
  * there needs to be a emms between mmx code and fpu code
  
  The solutions to this would be any one of
  A. let the programmer place emms like it has been in the past
  B. dont support mmx at all
  C. dont support x87 fpu at all
  D. place emms after every bunch of mmx instructions
  E. solve a quite non trivial problem and place emms optimally
  
  The solution which has been selected apparently is B., why was that choosen?
  Instead of lets say A.?
  
  If i do write SIMD code then i do know that i need an emms on x86. Its
  trivial for the programmer to place it optimally.
 
 I don't know where you get the idea that MMX support was dropped in any way. I

Maybe because the SIMD code in this PR compiled with -mmmx does not use mmx
but very significantly less efficient integer instructions. And you added a
test to gcc which ensures that this case does not use mmx instructions.

This is pretty much the definion of droping mmx support (for this specific
case).


 won't engage in a discussion about autovectorisation, intrinsics, builtins,
 generic vectorisation, etc, etc with you,

And somehow iam glad about that.


 but please look at PR 21395 how
 performance PR should be filled. 

 The MMX code in that PR is _far_ from trivial,

Well that is something i would disagree about.


 but since it is well written using intrinsic instructions, it enables
 jaw-dropping performance increase that is simply not possible when ASM blocks
 are used.
 
 Now, I'm sure that you have your numbers ready to back up your claims from
 Comment #33 about performance of generated code, and I challenge you to beat
 performance of gcc-4.4 generated code by hand-crafted assembly using the
 example of PR 21395.

done, 
jaw-dropping intrinsics need 
2.034s 

stinky hand written asm needs 
1.312s

But you can read the details in PR 21395.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-20 Thread ubizjak at gmail dot com


--- Comment #34 from ubizjak at gmail dot com  2008-03-20 09:49 ---
(In reply to comment #33)

 Anyway iam glad ffmpeg compiles fine under icc.

Me to. Now you will troll in their support lists.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-20 Thread michaelni at gmx dot at


--- Comment #35 from michaelni at gmx dot at  2008-03-20 17:18 ---
Subject: Re:  compiled trivial vector intrinsic code is
inefficient

On Thu, Mar 20, 2008 at 09:49:22AM -, ubizjak at gmail dot com wrote:
 
 
 --- Comment #34 from ubizjak at gmail dot com  2008-03-20 09:49 ---
 (In reply to comment #33)
 
  Anyway iam glad ffmpeg compiles fine under icc.
 
 Me to. Now you will troll in their support lists.

No, truth be, i dont plan to switch to icc yet. Somehow i do prefer to use
free tools. Of course if the gap becomes too big i as well as most others
will switch to icc ...
Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
not so much a problem for ffmpeg than it is for others who followed the
recommandition of intrinsics are better than asm.

About trolling, well i made no attempt to reply politely and diplomatic, no.
But solving a problem in some use case by droping support for that use
case is kinda extreem.

The way i see it is that
* Its non trivial to place emms optimally and automatically
* there needs to be a emms between mmx code and fpu code

The solutions to this would be any one of
A. let the programmer place emms like it has been in the past
B. dont support mmx at all
C. dont support x87 fpu at all
D. place emms after every bunch of mmx instructions
E. solve a quite non trivial problem and place emms optimally

The solution which has been selected apparently is B., why was that choosen?
Instead of lets say A.?

If i do write SIMD code then i do know that i need an emms on x86. Its
trivial for the programmer to place it optimally.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread ubizjak at gmail dot com


--- Comment #23 from ubizjak at gmail dot com  2008-03-19 10:45 ---
As said in PR 19161:

The LCM infrastructure doesn't support mode switching in the way that would be
usable for emms. Additionally, there are MANY problems expected when sharing
x87 and MMX registers (i.e. handling of uninitialized x87 registers at the
beginning of the function - this is the reason we don't implement x87 register
passing ABI).

Automatic MMX vectorization is not exactly a much usable feature nowadays (we
have SSE that works quite well here). Due to recent changes in MMX register
allocation area, excellent code is produced using MMX intrinsics, I'm closing
this bug as WONTFIX.

Also, auto-vectorization would produce either MMX or SSE code, but not both of
them:

#define UNITS_PER_SIMD_WORD (TARGET_SSE ? 16 : UNITS_PER_WORD)


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||WONTFIX


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread astrange at ithinksw dot com


--- Comment #24 from astrange at ithinksw dot com  2008-03-19 19:21 ---
For
typedef short mmxw  __attribute__ ((mode(V4HI)));
typedef int   mmxdw __attribute__ ((mode(V2SI)));

mmxdw dw;
mmxw w;

void test(){
w+=w;
dw= (mmxdw)w;
}

void test2(){
w= __builtin_ia32_paddw(w,w);
dw= (mmxdw)w;
}

gcc SVN generates the expected code for test2(), but not test(). I don't think
using += on an MMX variable should count as autovectorization - if you're doing
either you should know where to put emms yourself.

For test() we get:
subl$28, %esp
movq_w, %mm0
movq%mm0, 8(%esp)
movzwl  8(%esp), %eax
movzwl  10(%esp), %edx
movzwl  12(%esp), %ecx
addl%eax, %eax
addl%edx, %edx
movw%ax, _w
movw%dx, _w+2
movzwl  14(%esp), %eax
addl%ecx, %ecx
addl%eax, %eax
movw%cx, _w+4
movw%ax, _w+6
movq_w, %mm0
movq%mm0, _dw
addl$28, %esp
ret

which touches mm0 (requiring emms, I think) but not using paddw (so being slow
and silly-looking).
LLVM generates expected code for both of them.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread astrange at ithinksw dot com


--- Comment #25 from astrange at ithinksw dot com  2008-03-19 19:39 ---
Actually the first generates-
subl$12, %esp
movq_w, %mm0
paddw   %mm0, %mm0
movq%mm0, _w
movq_w, %mm0
movq%mm0, _dw
addl$12, %esp
ret

which is better than the code in the original report but still has a useless
store/reload.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread uros at gcc dot gnu dot org


--- Comment #26 from uros at gcc dot gnu dot org  2008-03-19 23:39 ---
Subject: Bug 14552

Author: uros
Date: Wed Mar 19 23:38:35 2008
New Revision: 133354

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=133354
Log:
PR target/14552
* config/i386/mmx.md (*movmode_internal_rex64): Adjust register
allocator preferences for y and r class registers.
(*movmode_internal): Ditto.
(*movv2sf_internal_rex64): Ditto.
(*movv2sf_internal): Ditto.

testsuite/ChangeLog:

PR target/14552
* gcc.target/i386/pr14552.c: New test.


Added:
trunk/gcc/testsuite/gcc.target/i386/pr14552.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/mmx.md
trunk/gcc/testsuite/ChangeLog


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread ubizjak at gmail dot com


--- Comment #27 from ubizjak at gmail dot com  2008-03-19 23:46 ---
(In reply to comment #25)
 Actually the first generates-
 subl$12, %esp
 movq_w, %mm0
 paddw   %mm0, %mm0
 movq%mm0, _w
 movq_w, %mm0
 movq%mm0, _dw
 addl$12, %esp
 ret
 
 which is better than the code in the original report but still has a useless
 store/reload.

The store is not useless. Reload from _w is how gcc handles double stores
nowadays and is not mmx specific. It looks that some pass forgot to check where
the value came from.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread pinskia at gcc dot gnu dot org


--- Comment #28 from pinskia at gcc dot gnu dot org  2008-03-19 23:49 
---
(In reply to comment #27)
 The store is not useless. Reload from _w is how gcc handles double stores
 nowadays and is not mmx specific. It looks that some pass forgot to check 
 where
 the value came from.

Do you happen to know if there are two different modes at work here?  If so
there are patches which fix this up in DSE and post-reload CSE.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread ubizjak at gmail dot com


--- Comment #29 from ubizjak at gmail dot com  2008-03-20 00:01 ---
Now we generate:

-m32 -mmmx -msse2:

test:
subl$20, %esp
movlw, %eax
movlw+4, %edx
movl%ebx, 12(%esp)
movl%esi, 16(%esp)
movl%eax, (%esp)
movzwl  (%esp), %ecx
movl%edx, 4(%esp)
movzwl  2(%esp), %ebx
movzwl  4(%esp), %esi
movzwl  6(%esp), %eax
addl%ecx, %ecx
addl%ebx, %ebx
addl%esi, %esi
addl%eax, %eax
movw%bx, w+2
movl12(%esp), %ebx
movw%si, w+4
movl16(%esp), %esi
movw%ax, w+6
movlw+4, %edx
movw%cx, w
movlw, %eax
movl%edx, dw+4
movl%eax, dw
addl$20, %esp
ret

-m64 -mmmx -msse2:

test:
movabsq $9223231297218904063, %rax
andqw(%rip), %rax
addq%rax, %rax
movq%rax, w(%rip)
movqw(%rip), %rax
movq%rax, dw(%rip)
ret

The issue with useless reload is PR 12395, as mentioned in Comment #5.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread ubizjak at gmail dot com


--- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04 ---
(In reply to comment #28)
 (In reply to comment #27)
  The store is not useless. Reload from _w is how gcc handles double stores
  nowadays and is not mmx specific. It looks that some pass forgot to check 
  where
  the value came from.
 
 Do you happen to know if there are two different modes at work here?  If so
 there are patches which fix this up in DSE and post-reload CSE.

Yes, from comment #24 (slightly changed):

typedef short mmxw  __attribute__ ((vector_size (8)));
typedef int   mmxdw __attribute__ ((vector_size (8)));

mmxdw dw;
mmxw w;

so, we have V4HI and V2SI.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



Re: [Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread Andrew Pinski

See pr 33790.

Sent from my iPhone

On Mar 19, 2008, at 17:04, ubizjak at gmail dot com [EMAIL PROTECTED] 
 wrote:





--- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04  
---

(In reply to comment #28)

(In reply to comment #27)
The store is not useless. Reload from _w is how gcc handles  
double stores
nowadays and is not mmx specific. It looks that some pass forgot  
to check where

the value came from.


Do you happen to know if there are two different modes at work  
here?  If so

there are patches which fix this up in DSE and post-reload CSE.


Yes, from comment #24 (slightly changed):

typedef short mmxw  __attribute__ ((vector_size (8)));
typedef int   mmxdw __attribute__ ((vector_size (8)));

mmxdw dw;
mmxw w;

so, we have V4HI and V2SI.


--


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread pinskia at gmail dot com


--- Comment #31 from pinskia at gmail dot com  2008-03-20 00:23 ---
Subject: Re:  compiled trivial vector intrinsic code is inefficient

See pr 33790.

Sent from my iPhone

On Mar 19, 2008, at 17:04, ubizjak at gmail dot com [EMAIL PROTECTED] 
  wrote:



 --- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04  
 ---
 (In reply to comment #28)
 (In reply to comment #27)
 The store is not useless. Reload from _w is how gcc handles  
 double stores
 nowadays and is not mmx specific. It looks that some pass forgot  
 to check where
 the value came from.

 Do you happen to know if there are two different modes at work  
 here?  If so
 there are patches which fix this up in DSE and post-reload CSE.

 Yes, from comment #24 (slightly changed):

 typedef short mmxw  __attribute__ ((vector_size (8)));
 typedef int   mmxdw __attribute__ ((vector_size (8)));

 mmxdw dw;
 mmxw w;

 so, we have V4HI and V2SI.


 -- 


 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread astrange at ithinksw dot com


--- Comment #32 from astrange at ithinksw dot com  2008-03-20 00:39 ---
This is missed on trees:
mmxdw dw;
mmxw w;

void test2(){
w= __builtin_ia32_paddw(w,w); w= (mmxdw)w;
}

void test3(){
mmxw w2= __builtin_ia32_paddw(w,w); dw= (mmxdw)w2;
}

test2 ()
{
  vector short int w.4;
  vector short int w.3;

bb 2:
  w.3 = w;
  w.4 = __builtin_ia32_paddw (w.3, w.3);
  w = w.4;
  dw = VIEW_CONVERT_EXPRvector int(w);
  return;
}

test3 ()
{
  mmxw w2;
  vector short int w.6;

bb 2:
  w.6 = w;
  w2 = __builtin_ia32_paddw (w.6, w.6);
  dw = VIEW_CONVERT_EXPRvector int(w2);
  return;
}


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-19 Thread michaelni at gmx dot at


--- Comment #33 from michaelni at gmx dot at  2008-03-20 01:37 ---
Subject: Re:  compiled trivial vector intrinsic code is
inefficient

On Wed, Mar 19, 2008 at 11:39:18PM -, uros at gcc dot gnu dot org wrote:
 
 
 --- Comment #26 from uros at gcc dot gnu dot org  2008-03-19 23:39 ---
 Subject: Bug 14552
[...]
 * gcc.target/i386/pr14552.c: New test.
 
 
 Added:
 trunk/gcc/testsuite/gcc.target/i386/pr14552.c

Thanks, i was already scared that the inverse proportional relation between
version number and performance which was so nicely followed since 2.95
would stop.
Adding a test to the testsuit to ensure that mmx intrinsics dont use
mmx registers is well, just brilliant.
Iam already eagerly awaiting the testcase which will check that floating
point code doesnt use the FPU, i assume that will happen in gcc 5.0?

Anyway iam glad ffmpeg compiles fine under icc.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2008-03-07 Thread ubizjak at gmail dot com


--- Comment #22 from ubizjak at gmail dot com  2008-03-08 07:29 ---
*** Bug 25277 has been marked as a duplicate of this bug. ***


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2005-11-30 Thread pluto at agmk dot net


--- Comment #21 from pluto at agmk dot net  2005-12-01 00:52 ---
I'm wondering is it possible to implement tranformations
of vector arithmetics into vector builtins?

e.g.

#include mmintrin.h
__v8qi foo(const __v8qi x, const __v8qi y) { return x + y; }
__v8qi bar(const __v8qi x, const __v8qi y) { return _mm_add_pi8(x, y); }

I except from compiler the same code for both functions
but it produces insane code for foo() :/

foo (x, y)
{
  unsigned int D.2377;
  unsigned int D.2376;
  unsigned int D.2369;
  unsigned int D.2368;
bb 0:
  D.2368 = BIT_FIELD_REF x, 32, 0;
  D.2369 = BIT_FIELD_REF y, 32, 0;
  D.2376 = BIT_FIELD_REF x, 32, 32;
  D.2377 = BIT_FIELD_REF y, 32, 32;
  return VIEW_CONVERT_EXPR__v8qi(
{(D.2368 ^ D.2369)  080808080 ^ (D.2369  2139062143) +
 (D.2368  2139062143),
 (D.2376 ^ D.2377)  080808080 ^ (D.2377  2139062143) +
 (D.2376  2139062143)});
}

bar (x, y)
{
  vector signed char D.2448;
bb 0:
  D.2448 = __builtin_ia32_paddb (
VIEW_CONVERT_EXPRvector signed char(VIEW_CONVERT_EXPR__m64(x)),
VIEW_CONVERT_EXPRvector signed char(VIEW_CONVERT_EXPR__m64(y)));
  return VIEW_CONVERT_EXPR__v8qi(VIEW_CONVERT_EXPRvector int(D.2448));
}

# gcc -O2 -march=pentium3 -fomit-frame-pointer -mregparm=3

foo:
subl$44, %esp
movq%mm0, 24(%esp)
movl%ebx, 32(%esp)
movl24(%esp), %ebx
movl%esi, 36(%esp)
movl28(%esp), %esi
movq%mm1, 24(%esp)
movl24(%esp), %eax
movl28(%esp), %edx
movl%edi, 40(%esp)
movl%ebx, %edi
andl$2139062143, %edi
movl%eax, %ecx
xorl%eax, %ebx
andl$2139062143, %ecx
movl%esi, %eax
addl%edi, %ecx
xorl%edx, %eax
movl40(%esp), %edi
andl$2139062143, %esi
andl$-2139062144, %ebx
andl$2139062143, %edx
xorl%ecx, %ebx
addl%esi, %edx
andl$-2139062144, %eax
movl36(%esp), %esi
movl%ebx, 20(%esp)
xorl%edx, %eax
movl32(%esp), %ebx
movss   20(%esp), %xmm0
movl%eax, 20(%esp)
movss   20(%esp), %xmm1
unpcklps%xmm1, %xmm0
movlps  %xmm0, 8(%esp)
movl8(%esp), %eax
movl12(%esp), %edx
movl%eax, (%esp)
movl%edx, 4(%esp)
movq(%esp), %mm1
addl$44, %esp
movq%mm1, %mm0
ret

bar:
paddb   %mm1, %mm0
ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2005-11-21 Thread pcarlini at suse dot de


-- 

pcarlini at suse dot de changed:

   What|Removed |Added

 AssignedTo|uros at kss-loka dot si |unassigned at gcc dot gnu
   ||dot org
 Status|ASSIGNED|NEW
Summary|compiled trivial vector |compiled trivial vector
   |intrinsic code is   |intrinsic code is
   |ineffiencent|inefficient


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2005-11-21 Thread pcarlini at suse dot de


--- Comment #17 from pcarlini at suse dot de  2005-11-21 11:34 ---
Sorry.


-- 

pcarlini at suse dot de changed:

   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |uros at kss-loka dot si
   |dot org |
 Status|NEW |ASSIGNED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2005-11-21 Thread pluto at agmk dot net


--- Comment #18 from pluto at agmk dot net  2005-11-21 15:05 ---
gcc-3.3.6 produces better code:

test:   movqw, %mm1
psllw   $1, %mm1
movq%mm1, w
movqw, %mm1
movq%mm1, dw
ret

.comm   dw,8,8
.comm   w,8,8


can we classify this as a code size regression?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2005-11-21 Thread pinskia at gcc dot gnu dot org


--- Comment #19 from pinskia at gcc dot gnu dot org  2005-11-21 15:09 
---
(In reply to comment #18)
 can we classify this as a code size regression?

No because 3.3.x was also wrong in the sense it did not emit an emms.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552



[Bug target/14552] compiled trivial vector intrinsic code is inefficient

2005-11-21 Thread pluto at agmk dot net


--- Comment #20 from pluto at agmk dot net  2005-11-21 18:38 ---
(In reply to comment #19)
 (In reply to comment #18)
  can we classify this as a code size regression?
 
 No because 3.3.x was also wrong in the sense it did not emit an emms.

ok.

gcc 4.1.0/20051113 with x87/mmx mode switch patch produces:

test:   movqw, %mm0
paddw   %mm0, %mm0
movq%mm0, w
movlw, %eax
movlw+4, %edx
movl%eax, dw
movl%edx, dw+4
emms
ret

.comm   dw,8,8
.comm   w,8,8

it isn't optimal but correct (emms opcode) and smaller than pure 4.1 output.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552