[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2020-04-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #53 from Peter Cordes  ---
I think we can close this as fixed at some point.  The last activity on this
bug was some patches that sound like they were supposed to fix, and the MCVEs
from comments I tested no longer has a problem.

GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align
the top of the loop, not between leave and ret or between cmp/jcc.

void wait_for_enter()
{
volatile int foo = 0;  // to get a LEAVE instruction emitted at all
int u = getchar();
while (!u)
u = getchar()-13;
}

https://godbolt.org/z/RvxzZv

(Note that Godbolt normally filters .p2align so you have to either compile to
binary or not filter directives in the asm source.  Otherwise you'll never see
NOPs except in the unusual case where GCC actually emits a nop mnemonic.)

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-21 Thread jakub at gcc dot gnu dot org


--- Comment #52 from jakub at gcc dot gnu dot org  2009-05-21 13:26 ---
Subject: Bug 39942

Author: jakub
Date: Thu May 21 13:26:13 2009
New Revision: 147766

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147766
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
.p2align 3 if MAX_SKIP is smaller than 7.
* config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

Modified:
branches/gcc-4_3-branch/gcc/ChangeLog
branches/gcc-4_3-branch/gcc/config/i386/linux.h
branches/gcc-4_3-branch/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-21 Thread jakub at gcc dot gnu dot org


--- Comment #51 from jakub at gcc dot gnu dot org  2009-05-21 13:21 ---
Subject: Bug 39942

Author: jakub
Date: Thu May 21 13:21:30 2009
New Revision: 147765

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147765
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
.p2align 3 if MAX_SKIP is smaller than 7.
* config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

Modified:
branches/gcc-4_4-branch/gcc/ChangeLog
branches/gcc-4_4-branch/gcc/config/i386/linux.h
branches/gcc-4_4-branch/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-20 Thread jakub at gcc dot gnu dot org


--- Comment #50 from jakub at gcc dot gnu dot org  2009-05-20 22:09 ---
nopl   0x0(%rax,%rax,1) and nopw   0x0(%rax,%rax,1) aren't padding (though, it
has been added in this case for label alignment or function entry alignment,
not to avoid 4+ jumps in one 16byte page)?
Anyway, you want to look at both -S output and objdump -d of -c output, there
you'll see needed .p2align added.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-20 Thread vvv at ru dot ru


--- Comment #49 from vvv at ru dot ru  2009-05-20 21:38 ---
(In reply to comment #48)
How this patches work? Is it required some special options?

# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
if (i == 1) F(1);
if (i == 2) F(2);
if (i == 3) F(3);
if (i == 4) F(4);
if (i == 5) F(5);
}
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# /media/disk-1/B/bin/gcc -o t test.c -O2 -c -mtune=k8
# objdump -d t
 :
   0:   83 ff 01cmp$0x1,%edi
   3:   74 1b   je 20 
   5:   83 ff 02cmp$0x2,%edi
   8:   74 16   je 20 
   a:   83 ff 03cmp$0x3,%edi
   d:   74 11   je 20 
   f:   83 ff 04cmp$0x4,%edi
  12:   74 0c   je 20 
  14:   83 ff 05cmp$0x5,%edi
  17:   74 07   je 20 
  19:   f3 c3   repz retq 
  1b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  20:   31 c0   xor%eax,%eax
  22:   e9 00 00 00 00  jmpq   27 
  27:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
  2e:   00 00 

0030 :
  30:   48 83 ec 08 sub$0x8,%rsp
  34:   e8 00 00 00 00  callq  39 
  39:   85 c0   test   %eax,%eax
  3b:   89 c7   mov%eax,%edi
  3d:   74 0e   je 4d 
  3f:   90  nop
  40:   e8 00 00 00 00  callq  45 
  45:   8d 7c 40 01 lea0x1(%rax,%rax,2),%edi
  49:   85 ff   test   %edi,%edi
  4b:   75 f3   jne40 
  4d:   48 83 c4 08 add$0x8,%rsp
  51:   c3  retq   

I can't see any padding in function f :(

PS. In file config/i386/i386.c (ix86_avoid_jump_mispredicts)

  /* Look for all minimal intervals of instructions containing 4 jumps.
...

Not jumps, but _branches_ (CALL, JMP, conditional branches, or returns) 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-18 Thread hjl at gcc dot gnu dot org


--- Comment #48 from hjl at gcc dot gnu dot org  2009-05-18 17:21 ---
Subject: Bug 39942

Author: hjl
Date: Mon May 18 17:21:13 2009
New Revision: 147671

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147671
Log:
2009-05-18  H.J. Lu  

PR target/39942
* config/i386/i386.c (ix86_avoid_jump_misspredicts): Replace
gen_align with gen_pad.
(ix86_reorg): Check ASM_OUTPUT_MAX_SKIP_PAD instead of
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN.

* config/i386/i386.h (ASM_OUTPUT_MAX_SKIP_PAD): New.
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_PAD): Likewise.

* config/i386/i386.md (align): Renamed to ...
(pad): This.  Replace ASM_OUTPUT_MAX_SKIP_ALIGN with
ASM_OUTPUT_MAX_SKIP_PAD.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/i386.h
trunk/gcc/config/i386/i386.md
trunk/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-16 Thread jakub at gcc dot gnu dot org


--- Comment #47 from jakub at gcc dot gnu dot org  2009-05-16 07:12 ---
Subject: Bug 39942

Author: jakub
Date: Sat May 16 07:12:02 2009
New Revision: 147607

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147607
Log:
PR target/39942
* final.c (label_to_max_skip): New function.
(label_to_alignment): Only use LABEL_TO_ALIGNMENT if
CODE_LABEL_NUMBER <= max_labelno.
* output.h (label_to_max_skip): New prototype.
* config/i386/i386.c (ix86_avoid_jump_misspredicts): Renamed to...
(ix86_avoid_jump_mispredicts): ... this.  Don't define if
ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined.  Update comment.
Handle CODE_LABELs with >= 16 byte alignment or with
max_skip == (1 << align) - 1.
(ix86_reorg): Don't call ix86_avoid_jump_mispredicts if
ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/final.c
trunk/gcc/output.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-16 Thread jakub at gcc dot gnu dot org


--- Comment #46 from jakub at gcc dot gnu dot org  2009-05-16 07:10 ---
Subject: Bug 39942

Author: jakub
Date: Sat May 16 07:09:52 2009
New Revision: 147606

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147606
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
.p2align 3 if MAX_SKIP is smaller than 7.
* config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/linux.h
trunk/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #45 from jakub at gcc dot gnu dot org  2009-05-16 06:37 ---
cmpl $1, %eax does have the modrm byte:
83 f8 01 cmp$0x1,%eax
compared to cmpl $0xdeadbeef, $eax which doesn't have it:
3d ef be ad de   cmp$0xdeadbeef,%eax
So I think what I wrote is more precise.  modrm byte is there if the insn has
ax_reg_operand destination and immediate source which hasn't been shortened.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread hjl dot tools at gmail dot com


--- Comment #44 from hjl dot tools at gmail dot com  2009-05-15 23:05 
---
(In reply to comment #41)
> The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
> look at one of the cases which was wrong and the problem is that cmp $0x1d, 
> %al
> has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC
> thinks it has modrm byte when it has not.
> Testing:
> --- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.0 +0200
> +++ gcc/config/i386/i386.md2009-05-15 18:06:40.0 +0200
> @@ -504,6 +504,9 @@
>   (and (eq_attr "type" "callv")
>(match_operand 1 "constant_call_address_operand" ""))
>   (const_int 0)
> + (and (eq_attr "type" "alu,alu1,icmp,test")
> +  (match_operand 0 "ax_reg_operand" ""))
> + (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) 
> !=
> MODE_QI))")
>   ]
>   (const_int 1)))
> 

"cmp imm,%al/%ax/%eax/%rax" doesn't have the modrm byte. I think
this patch works better:

--- i386.md.branch  2009-05-15 11:30:42.0 -0700
+++ i386.md 2009-05-15 14:44:11.0 -0700
@@ -504,6 +504,10 @@
 (and (eq_attr "type" "callv")
  (match_operand 1 "constant_call_address_operand" ""))
 (const_int 0)
+(and (eq_attr "type" "alu,alu1,icmp,test")
+ (match_operand 0 "ax_reg_operand" "")
+ (match_operand 1 "immediate_operand" ""))
+(const_int 0)
 ]
 (const_int 1)))



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #43 from jakub at gcc dot gnu dot org  2009-05-15 18:23 ---
Some code size growth is from enlarged get_attr_modrm though, 292 bytes for
64-bit, 1338 bytes for 32-bit.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #42 from jakub at gcc dot gnu dot org  2009-05-15 18:18 ---
Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are:
0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in
32-bit.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #41 from jakub at gcc dot gnu dot org  2009-05-15 16:24 ---
The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC
thinks it has modrm byte when it has not.
Testing:
--- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.0 +0200
+++ gcc/config/i386/i386.md2009-05-15 18:06:40.0 +0200
@@ -504,6 +504,9 @@
  (and (eq_attr "type" "callv")
   (match_operand 1 "constant_call_address_operand" ""))
  (const_int 0)
+ (and (eq_attr "type" "alu,alu1,icmp,test")
+  (match_operand 0 "ax_reg_operand" ""))
+ (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) !=
MODE_QI))")
  ]
  (const_int 1)))

now on top of the 3 patches (without the 4th) to see what it does to code size
and number of 4+ branches in 16 byte page.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread hjl dot tools at gmail dot com


--- Comment #40 from hjl dot tools at gmail dot com  2009-05-15 14:35 
---
(In reply to comment #37)
> This patch looks very wrong.  It assumes that min_insn_size gives exact insn
> sizes (current min_insn_size is very far from that, but even get_attr_length
> isn't exact), doesn't take into account label alignments nor branch shortening
> which can change the insn sizes afterwards and assumes that a p2align always
> aligns to 16 bytes (it does not).
> While the previous algorithm works with estimated 16 consecutive bytes rather
> than 16 byte pages 0x0 ... 0xXXXF, that's because during machine reorg
> you simply can't know in most cases where exactly the 16 byte page will start,
> so you assume it can start (almost) anywhere (and use .p2align behavior to
> align when needed).
> 

There is no perfect solution here. Let's list pros/cons:

The current algorithm:

pros:

1. Very conservative. Catch most of 4 branches in 16byte windows.

Cons:

1. It works on 16byte window, not 16byte page.
2. When it gets wrong, it increases code sizes by adding unnecessary
nops.

My proposal:

Pros:

1. Work on 16byte page.
2. Even if it gets wrong, it doesn't increase code size.

Cons:

1. Rely on inaccurate instruction length data.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #39 from jakub at gcc dot gnu dot org  2009-05-15 12:12 ---
Created an attachment (id=17874)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874&action=view)
test4jmp.sh


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #38 from jakub at gcc dot gnu dot org  2009-05-15 12:11 ---
To extend #c31, I've also built the same tree with another patch which made
sure ix86_avoid_jump_mispredicts is never called (change "&& optimize" into "&&
optimize > 4" in ix86_reorg).  cc1plus sizes were then
0x88d6d8 bytes for 64-bit cc1plus and 0x8c980c bytes for 32-bit cc1plus.
That is 2.42% smaller than vanilla resp. 2.06%.

I've also changed my testing script, so that it (hopefully, point me to errors)
follows the AMD rules more closely (namely that the last byte in the branch
insn counts, not the first one), will attach.
With that script, I got following number of violations for 64-bit cc1plus
(vanilla, +1patch, +2patches, +3patches, +4patches):
6, 7, 6, 51, 138
and for 32-bit cc1plus:
1, 2, 3, 34, 159.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-15 Thread jakub at gcc dot gnu dot org


--- Comment #37 from jakub at gcc dot gnu dot org  2009-05-15 07:56 ---
This patch looks very wrong.  It assumes that min_insn_size gives exact insn
sizes (current min_insn_size is very far from that, but even get_attr_length
isn't exact), doesn't take into account label alignments nor branch shortening
which can change the insn sizes afterwards and assumes that a p2align always
aligns to 16 bytes (it does not).
While the previous algorithm works with estimated 16 consecutive bytes rather
than 16 byte pages 0x0 ... 0xXXXF, that's because during machine reorg
you simply can't know in most cases where exactly the 16 byte page will start,
so you assume it can start (almost) anywhere (and use .p2align behavior to
align when needed).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com


--- Comment #36 from hjl dot tools at gmail dot com  2009-05-15 04:32 
---
Created an attachment (id=17871)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17871&action=view)
An updated patch

A few comments:

1. 3 branch limit is per 16byte page, not 16byte window.
2. We should allow 3 branches in a 16byte page.
3. When we have 4 branches in a 16byte page, we only need to
pad to the 16byte page boundary before the 4th branch, which
will start at the next 16byte page.


-- 

hjl dot tools at gmail dot com changed:

   What|Removed |Added

  Attachment #17870|0   |1
is obsolete||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com


--- Comment #35 from hjl dot tools at gmail dot com  2009-05-15 02:23 
---
Created an attachment (id=17870)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17870&action=view)
A patch

This patch limits 3 branches per 16byte page.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread vvv at ru dot ru


--- Comment #34 from vvv at ru dot ru  2009-05-14 19:43 ---
(In reply to comment #32)
> Please make sure that you only test nop paddings for branch insns,
> not nop paddings for branch targets, which prefer 16byte alignment.

Additional tests (for Core2) results:
1. Execution time don't depend on paddings for branch target.
2. Execution time don't depend on position of NOP within 16-byte chunk with 4
branch. Even if NOP inserted between CMP and conditional jump.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com


--- Comment #33 from hjl dot tools at gmail dot com  2009-05-14 18:37 
---
(In reply to comment #20)
> Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
> very very likely refers to a cache-line.
> 

That is true. For Intel CPUs, "16-bytes chunk" means memory range XXX0 - XXXF.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread hjl dot tools at gmail dot com


--- Comment #32 from hjl dot tools at gmail dot com  2009-05-14 15:58 
---
(In reply to comment #30)
> Created an attachment (id=17863)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) [edit]
> Testing tool.
> 

Please make sure that you only test nop paddings for branch insns,
not nop paddings for branch targets, which prefer 16byte alignment.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread jakub at gcc dot gnu dot org


--- Comment #31 from jakub at gcc dot gnu dot org  2009-05-14 15:15 ---
Some -O2 code size data from today's trunk bootstraps.  The first .text line
is always vanilla bootstrap, the second one with
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
only, the third one with that patch and
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html
and the last with additional:
--- i386.c.jj3  2009-05-14 12:41:24.0 +0200
+++ i386.c  2009-05-14 14:48:24.0 +0200
@@ -27202,7 +27202,7 @@ x86_function_profiler (FILE *file, int l
 static int
 min_insn_size (rtx insn)
 {
-  int l = 0;
+  int l = 0, len;

   if (!INSN_P (insn) || !active_insn_p (insn))
 return 0;
@@ -27222,7 +27222,8 @@ min_insn_size (rtx insn)
   && symbolic_reference_mentioned_p (PATTERN (insn))
   && !SIBLING_CALL_P (insn))
 return 5;
-  if (get_attr_length (insn) <= 1)
+  len = get_attr_length (insn);
+  if (len <= 1)
 return 1;

   /* For normal instructions we may rely on the sizes of addresses
@@ -27230,6 +27231,9 @@ min_insn_size (rtx insn)
  This is not the case for jumps where references are PC relative.  */
   if (!JUMP_P (insn))
 {
+  if (get_attr_type (insn) != TYPE_MULTI)
+   return len;
+
   l = get_attr_length_address (insn);
   if (l < 4 && symbolic_reference_mentioned_p (PATTERN (insn)))
l = 4;
to see how the code size changes with much more accurate (though sometimes not
minimum but maximum bound) insn sizing for the algorithm.
64-bit cc1plus
  [12] .text PROGBITS0047f990 07f990 8c3ba8 00  AX 
0   0 16
  [12] .text PROGBITS0047f990 07f990 89b1e8 00  AX 
0   0 16
  [12] .text PROGBITS0047f9c0 07f9c0 899f78 00  AX 
0   0 16
  [12] .text PROGBITS0047f9c0 07f9c0 88eaf8 00  AX 
0   0 16
32-bit cc1plus
  [12] .text PROGBITS080b24e0 06a4e0 8f8cac 00  AX  0   0
16
  [12] .text PROGBITS080b24e0 06a4e0 8d516c 00  AX  0   0
16
  [12] .text PROGBITS080b2510 06a510 8d507c 00  AX  0   0
16
  [12] .text PROGBITS080b2510 06a510 8cbd7c 00  AX  0   0
16
For 64-bit cc1plus that's 1.8%, 1.86%, 2.36% smaller binary with the 1, 2 resp.
3 patches, for 32-bit cc1plus 1.55%, 1.56%, 1.96% smaller binary.
So the first patch is the most important and something like the third one,
perhaps with more exceptions, also makes a difference.  I'll now try to update
my awk script to check for the AMD rules, namely that the last byte of the
branch insn counts rather than the first.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread vvv at ru dot ru


--- Comment #30 from vvv at ru dot ru  2009-05-14 09:01 ---
Created an attachment (id=17863)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view)
Testing tool.

Here is results of my testing.
Code:
align   128
test_cikl:
rept 14 ; 14 if SH=0, 15 if SH=1, 16 if SH=2
{
nop
}
 cmp  al,0   ; 2 bytes

 jz   $+10h+NOPS ; 2 bytes offset=0
 cmp  al,1   ; 2 bytes offset=2
 jz   $+0Ch+NOPS ; 2 bytes offset=4
 cmp  al,2   ; 2 bytes offset=6
 jz   $+08h+NOPS ; 2 bytes offset=8
 cmp  al,3   ; 2 bytes offset=A
match =1, NOPS
{
   nop
}
match =2, NOPS
{
   xchg eax,eax ; 2-bytes NOP
}
 jz   $+04h  ; 2 bytes offset=C
 ja   $+02h  ; 2 bytes offset=E

 mov  eax,ecx
 and  eax,7h
 loop test_cikl

This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks.

; Core 2 Duo
;NOPS/tick/Max  NOPS/tick/MaxNOPS/tick/Max
; SH=0  0/571/729  1/306/594   2/315/630
; SH=1  0/338/612  1/338/648   2/339/648
; SH=2  0/339/666  1/339/675   2/333/693

; Xeon 3110
;NOPS/tick/Max  NOPS/tick/MaxNOPS/tick/Max
; SH=0  0/586/693  1/310/675   2/310/675
; SH=1  0/333/657  1/330/648   2/464/630
; SH=2  0/333/657  1/470/594   2/474/603

; P4
;NOPS/tick/Max  NOPS/tick/MaxNOPS/tick/Max
; SH=0 0/1027/1317 1/1094/1258 2/1028/1207
; SH=1 0/1151/1377 1/1068/1352 2/902/1275
; SH=2 0/1124/1275 1/1148/1335 2/979/1139

Conclusion:
1. Core2 and Xeon - similar results. P4 - something strange.
For Core2 & Xeon padding very effective. Code with padding almoust 2 times
faster. No sence for P4?
2. My previous sentence

VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but
VVV> Intel limitation for 16-bytes chunk  (memory range  - +10h)

is wrong. At leat for Core2 & Xeon. For this CPU "16-bytes chunk" means
memory range XXX0 - XXXF.

Unfortunately, I can't test AMD.

PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode,
switch to 64-bit mode and measure rdtsc ticks for test code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread hjl dot tools at gmail dot com


--- Comment #29 from hjl dot tools at gmail dot com  2009-05-13 21:44 
---
Created an attachment (id=17858)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858&action=view)
Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K

This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn and Nehalem.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #28 from vvv at ru dot ru  2009-05-13 19:18 ---
(In reply to comment #24)
> Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
> it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16 byte chunk - is very-very infrequent. Especially for 64-bit
mode.

BTW, it's difficult to understand, what Intel mean ander term "branch". Is it
CALL, JMP, conditional branches, or returns (same as AMD), or only JMP and
conditional branches. I beleave last case right.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread jakub at gcc dot gnu dot org


--- Comment #27 from jakub at gcc dot gnu dot org  2009-05-13 19:08 ---
If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped
from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see
SPEC numbers backing that up.  Similarly for AMD CPUs, and if on at least one
of these it is beneficial, probably m_GENERIC should keep it, though with far
improved min_insn_size so that it doesn't trigger unnecessarily so often.

Vlad/Honza, could one of you SPEC test say current 4.4 with one (or both of):
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html
on top of it (no need to compare clearly unneeded paddings to no paddings,
better compare only somewhat needed paddings against no paddings) compared with
X86_TUNE_FOUR_JUMP_LIMIT cleared for the used CPU?


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #26 from vvv at ru dot ru  2009-05-13 19:05 ---
(In reply to comment #23)
> Note that we need something that works for the generic model as well, which in
> this case looks like it is the same as for AMD models.

There is processor property TARGET_FOUR_JUMP_LIMIT, may be create new one -
TARGET_FIVE_JUMP_LIMIT? 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #25 from vvv at ru dot ru  2009-05-13 18:56 ---
(In reply to comment #22)
> CCing H.J for Intel optimization issues.

VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),
but
VVV> Intel limitation for 16-bytes chunk  (memory range  -
+10h)

I have a doubt about this now. Sanks to Richard Guenther (Comment #20). So I am
going to make measurements for check it for Core2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread hjl dot tools at gmail dot com


--- Comment #24 from hjl dot tools at gmail dot com  2009-05-13 18:45 
---
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread rguenth at gcc dot gnu dot org


--- Comment #23 from rguenth at gcc dot gnu dot org  2009-05-13 18:34 
---
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread ubizjak at gmail dot com


--- Comment #22 from ubizjak at gmail dot com  2009-05-13 18:22 ---
(In reply to comment #21)
> I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD 
> OpteronTM
> processors, but it is nonoptimal for Intel processors. Because:

...

CCing H.J for Intel optimization issues.


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 CC||hjl dot tools at gmail dot
   ||com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #21 from vvv at ru dot ru  2009-05-13 17:13 ---
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:

1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but
Intel limitation for 16-bytes chunk  (memory range  - +10h)
2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or
returns),
Intel -  maximum of _FOUR_ branches!

Quotation from Software Optimization Guide for AMD64 Processors

6.1  Density of Branches
When possible, align branches such that they do not cross a 16-byte boundary.

The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache
branch-prediction history for a maximum of three near branches (CALL, JMP,
conditional branches, or returns) per 16-byte fetch window. A branch
instruction that crosses a 16-byte boundary is counted in the second 16-byte
window. Due to architectural restrictions, a branch that is split across a
16-byte
boundary cannot dispatch with any other instructions when it is predicted
taken. Perform this alignment by rearranging code; it is not beneficial to
align branches using padding sequences.

The following branches are limited to three per 16-byte window:

jcc   rel8
jcc   rel32
jmp   rel8
jmp   rel32
jmp   reg
jmp   WORD PTR
jmp   DWORD PTR
call  rel16
call  r/m16
call  rel32
call  r/m32

Coding more than three branches in the same 16-byte code window may lead to
conflicts in the branch target buffer. To avoid conflicts in the branch target
buffer, space out branches such that three or fewer exist in a given 16-byte
code window. For absolute optimal performance, try to limit branches to one per
16-byte code window. Avoid code sequences like the following:
ALIGN 16
label3:
 call   label1 ;  1st branch in 16-byte code   window
 jc label3 ;  2nd branch in 16-byte code   window
 call   label2 ;  3rd branch in 16-byte code   window
 jnzlabel4 ;  4th branch in 16-byte code   window
   ;  Cannot be predicted.
If there is a jump table that contains many frequently executed branches, pad
the table entries to 8 bytes each to assure that there are never more than
three branches per 16-byte block of code.
Only branches that have been taken at least once are entered into the dynamic
branch prediction, and therefore only those branches count toward the
three-branch limit.




-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread rguenth at gcc dot gnu dot org


--- Comment #20 from rguenth at gcc dot gnu dot org  2009-05-13 13:31 
---
Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
very very likely refers to a cache-line.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #19 from vvv at ru dot ru  2009-05-13 11:42 ---
(In reply to comment #18)
> No, .p2align is the right thing to do, given that GCC doesn't have 100%
> accurate information about instruction sizes (for e.g. inline asms it can't
> have, for
> stuff where branch shortening can decrease the size doesn't have it until the
> shortening branch phase which is too late for this machine reorg, and in other
> cases the lengths are just upper bounds).  Say .p2align 16,,5 says
> insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
> otherwise don't insert anything.  But that necessarily means that there were
> less than 11 bytes in the same 16 byte page and if the lower bound insn size
> estimation determined that in 11 bytes you can't have 3 branch changing
> instructions, you are fine.  Breaking of fused compare and jump (32-bit code
> only) is unfortunate, but inserting it before the cmp would mean often
> unnecessarily large padding.

You are rigth, if padding required for every 16-byte page with 4 branches on
it. But Intel writes about "16-byte chunk", not "16-byte page".

Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual:

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put
more than four branches in a 16-byte chunk.

IMHO, here chunk - memory range from x to x+10h, where x - _any_ address. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread jakub at gcc dot gnu dot org


--- Comment #18 from jakub at gcc dot gnu dot org  2009-05-13 08:30 ---
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can decrease the size doesn't have it until the
shortening branch phase which is too late for this machine reorg, and in other
cases the lengths are just upper bounds).  Say .p2align 16,,5 says
insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
otherwise don't insert anything.  But that necessarily means that there were
less than 11 bytes in the same 16 byte page and if the lower bound insn size
estimation determined that in 11 bytes you can't have 3 branch changing
instructions, you are fine.  Breaking of fused compare and jump (32-bit code
only) is unfortunate, but inserting it before the cmp would mean often
unnecessarily large padding.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-12 Thread vvv at ru dot ru


--- Comment #17 from vvv at ru dot ru  2009-05-12 16:40 ---
(In reply to comment #16)
> Created an attachment (id=17783)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) [edit]
> gcc45-pr39942.patch
> Patch that attempts to take into account .p2align directives that are emitted
> for (some) CODE_LABELs and also the gen_align insns that the pass itself
> inserts.  For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
> directive starts a new 16 byte page (then insns before it are never
> interesting), or nothing was skipped because more than 10 bytes would need to
> be skipped.  But that means the current group could contain only 5 or less
> bytes of instructions before the label, so again, we don't have to look at
> instructions not in the last 5 bytes.
> Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
> the second .p2align 3, which might (and often does) skip more than MAX_SKIP
> bytes (up to 7).

Nice path. Code looks better. It checked on Linux kernel 2.6.29.2.
But 2 notes:

1.There is no garanty that .p2align will be translated to NOPs. Example:

# cat test.c
void f(int i)
{
if (i == 1) F(1);
if (i == 2) F(2);
if (i == 3) F(3);
if (i == 4) F(4);
if (i == 5) F(5);
}
# gcc -o test.s test.c -O2 -S
# cat test.s
.file   "test.c"
.text
.p2align 4,,15
.globl f
.type   f, @function
f:
.LFB0:
.cfi_startproc
cmpl$1, %edi
je  .L7
cmpl$2, %edi
je  .L7
cmpl$3, %edi
je  .L7
cmpl$4, %edi
.p2align 4,,5<--- attempt of padding
je  .L7
cmpl$5, %edi
je  .L7
rep
ret
.p2align 4,,10
.p2align 3
.L7:
xorl%eax, %eax
jmp F
.cfi_endproc
.LFE0:
.size   f, .-f
.ident  "GCC: (GNU) 4.5.0 20090512 (experimental)"
.section.note.GNU-stack,"",@progbits

# gcc -o test.out test.s -O2 -c
# objdump -d test.out
 :
   0:   83 ff 01cmp$0x1,%edi
   3:   74 1b   je 20 
   5:   83 ff 02cmp$0x2,%edi
   8:   74 16   je 20 
   a:   83 ff 03cmp$0x3,%edi
   d:   74 11   je 20 
   f:   83 ff 04cmp$0x4,%edi
  12:   74 0c   je 20   < no NOP here 
  14:   83 ff 05cmp$0x5,%edi
  17:   74 07   je 20 
  19:   f3 c3   repz retq 

IMHO, better to insert not .p2align, but NOPs directly. ( I mean line -
emit_insn_before (gen_align (GEN_INT (padsize)), insn); )

2. IMHO, it's bad idea to insert somthing between CMP and conditional jmp.
Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual

>> 3.4.2.2   Optimizing for Macro-fusion
>> Macro-fusion merges two instructions to a single μop. Intel Core 
>> Microarchitecture
>> performs this hardware optimization under limited circumstances.
>> The first instruction of the macro-fused pair must be a CMP or TEST 
>> instruction. This
>> instruction can be REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The
>> second instruction (adjacent in the instruction stream) should be a 
>> conditional
>> branch.

So if we need to insert NOPs, better to do it _before_ CMP.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-30 Thread jakub at gcc dot gnu dot org


--- Comment #16 from jakub at gcc dot gnu dot org  2009-04-30 09:07 ---
Created an attachment (id=17783)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view)
gcc45-pr39942.patch

Patch that attempts to take into account .p2align directives that are emitted
for (some) CODE_LABELs and also the gen_align insns that the pass itself
inserts.  For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
directive starts a new 16 byte page (then insns before it are never
interesting), or nothing was skipped because more than 10 bytes would need to
be skipped.  But that means the current group could contain only 5 or less
bytes of instructions before the label, so again, we don't have to look at
instructions not in the last 5 bytes.

Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
the second .p2align 3, which might (and often does) skip more than MAX_SKIP
bytes (up to 7).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru


--- Comment #15 from vvv at ru dot ru  2009-04-29 19:16 ---
One more example 5-bytes nop between leaveq and retq.

# cat test.c

void wait_for_enter()
{
int u = getchar();
while (!u)
u = getchar()-13;
}
main()
{
wait_for_enter();
}

# gcc -o t.out test.c -O2 -march=core2 -fno-omit-frame-pointer
# objdump -d t.out
...
00400540 :
  400540:   55  push   %rbp
  400541:   31 c0   xor%eax,%eax
  400543:   48 89 e5mov%rsp,%rbp
  400546:   e8 f5 fe ff ff  callq  400440 
  40054b:   85 c0   test   %eax,%eax
  40054d:   75 13   jne400562 
  40054f:   90  nop
  400550:   31 c0   xor%eax,%eax
  400552:   e8 e9 fe ff ff  callq  400440 
  400557:   83 f8 0dcmp$0xd,%eax
  40055a:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)
  400560:   74 ee   je 400550 
  400562:   c9  leaveq 
  400563:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)  <--NONOPTIMAL!
  400568:   c3  retq   
  400569:   0f 1f 80 00 00 00 00nopl   0x0(%rax)

00400570 :
  400570:   55  push   %rbp
  400571:   31 c0   xor%eax,%eax
  400573:   48 89 e5mov%rsp,%rbp
  400576:   e8 c5 ff ff ff  callq  400540 
  40057b:   c9  leaveq 
  40057c:   c3  retq   
  40057d:   90  nop
  40057e:   90  nop
  40057f:   90  nop

So bug unresolved.


-- 

vvv at ru dot ru changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread jakub at gcc dot gnu dot org


--- Comment #14 from jakub at gcc dot gnu dot org  2009-04-29 10:12 ---
Also, couldn't we use the information computed by compute_alignments and
assume CODE_LABELs are aligned?
Probably would need to add label_to_max_skip (rtx) function to final.c,
so that not just label_to_alignment, but also LABEL_TO_MAX_SKIP is available to
backends.  Then when we know for the label in the testcase that
.p2align 4,,10
.p2align 3
then we know the 16 byte boundary is either at that label, or at most 5 bytes
before it, so all we need is consider any jumps/calls in the last 5 bytes
before the label.

For min_insn_size, is it possible to find out for which non-jump/call insns
get_attr_length might not be exact (i.e. be a maximum guess rather than
guaranteed size (though of course, this is also just an optimization, so 100%
guarantees aren't needed either))?  If so, we could use get_attr_length for the
insns where it is known to be exact...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread jakub at gcc dot gnu dot org


--- Comment #13 from jakub at gcc dot gnu dot org  2009-04-29 09:32 ---
You are benchmarking something completely unrelated.
What really matters is how code that has 4 branches/calls in one 16-byte block
is able to predict all those branches.  And Core2 similarly to various AMD CPUs
is not able to predict them well.

In the #c6 testcase it considers the je, call, jne and ret whether they can be
in a 16 byte block or not.  They can't, je is 2 bytes, call 5 bytes, leal 4
bytes (but gcc uses min_insn_size, which is 2 in this case), testl 2, jne 2,
addq 4 (but again, min_insn_size is 2 in this case).
min_insn_size seems to be very conservative, I guess teaching it about a bunch
of prefixes couldn't hurt, for non-jump/call insns ATM it estimates just the
displacement size, doesn't consider any prefixes (even those that really can't
change after machine reorg), etc.


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru


--- Comment #12 from vvv at ru dot ru  2009-04-29 07:55 ---
(In reply to comment #9)
> So that explains it, Use -Os or attribute cold if you want NOPs to be gone.

But my measurements on Core 2 Duo P8600 show that

push %ebp
mov  %esp,%ebp
leave
ret

_faster_ then

push %ebp
mov  %esp,%ebp
leave
xchg %ax,%ax
ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru


--- Comment #11 from vvv at ru dot ru  2009-04-29 07:46 ---
(In reply to comment #8)
> From config/i386/i386.c:
> /* AMD Athlon works faster
>when RET is not destination of conditional jump or directly preceded
>by other jump instruction.  We avoid the penalty by inserting NOP just
>before the RET instructions in such cases.  */
> static void
> ix86_pad_returns (void)
> ...

But I am using Core 2 Duo.
Why we see multibyte nop, not single byte nop?
Why if change line u = F(u)*3+1; to u = F(u)*4+1; or u = F(u); number of nops
changed?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread ubizjak at gmail dot com


--- Comment #10 from ubizjak at gmail dot com  2009-04-28 21:53 ---
Actually, alignment is from ix86_avoid_jump_misspredicts, where:

  /* Look for all minimal intervals of instructions containing 4 jumps.
 The intervals are bounded by START and INSN.  NBYTES is the total
 size of instructions in the interval including INSN and not including
 START.  When the NBYTES is smaller than 16 bytes, it is possible
 that the end of START and INSN ends up in the same 16byte page.

 The smallest offset in the page INSN can start is the case where START
 ends on the offset 0.  Offset of INSN is then NBYTES - sizeof (INSN).
 We add p2align to 16byte window with maxskip 17 - NBYTES + sizeof (INSN).
 */

So, this is by design. Use -Os if code size is important.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread pinskia at gcc dot gnu dot org


--- Comment #9 from pinskia at gcc dot gnu dot org  2009-04-28 21:52 ---
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution||INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread ubizjak at gmail dot com


--- Comment #8 from ubizjak at gmail dot com  2009-04-28 21:47 ---
>From config/i386/i386.c:

/* AMD Athlon works faster
   when RET is not destination of conditional jump or directly preceded
   by other jump instruction.  We avoid the penalty by inserting NOP just
   before the RET instructions in such cases.  */
static void
ix86_pad_returns (void)
...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #6 from vvv at ru dot ru  2009-04-28 21:18 ---
Let's compile file test.c
//#file test.c

extern int F(int m);

void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}


# gcc -o t.out test.c -c -O2
# objdump -d t.out

t.out: file format elf64-x86-64


Disassembly of section .text:

 :
   0:   48 83 ec 08 sub$0x8,%rsp
   4:   e8 00 00 00 00  callq  9 
   9:   85 c0   test   %eax,%eax
   b:   89 c7   mov%eax,%edi
   d:   74 0e   je 1d 
   f:   90  nop
  10:   e8 00 00 00 00  callq  15 
  15:   8d 7c 40 01 lea0x1(%rax,%rax,2),%edi
  19:   85 ff   test   %edi,%edi
  1b:   75 f3   jne10 
  1d:   48 83 c4 08 add$0x8,%rsp
  21:   0f 1f 80 00 00 00 00nopl   0x0(%rax)   < nonoptimal
  28:   c3  retq   


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread pinskia at gcc dot gnu dot org


--- Comment #7 from pinskia at gcc dot gnu dot org  2009-04-28 21:23 ---
4.1.2 produces:
.L4:
addq$8, %rsp
.p2align 4,,2
ret

While the trunk produces:
.L1:
addq$8, %rsp
.p2align 4,,2
.p2align 3
ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread ubizjak at gmail dot com


--- Comment #5 from ubizjak at gmail dot com  2009-04-28 17:37 ---
Unfortunately, all code snippets and dumps are of no use. Please see
http://gcc.gnu.org/bugs.html for the reason why.

As an exercise, please compile *standalone* _preprocessed_ source you will
create with -S added to your compile flags and count the number of .p2align
directives in the code stream.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #4 from vvv at ru dot ru  2009-04-28 17:15 ---
Created an attachment (id=1)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1&action=view)
Simple example from Linux

See two functons:
static void pre_schedule_rt
static void switched_from_rt


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #3 from vvv at ru dot ru  2009-04-28 17:10 ---
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!) 
=
linux/drivers/video/console/bitblit.c

void fbcon_set_bitops(struct fbcon_ops *ops)
{
ops->bmove = bit_bmove;
ops->clear = bit_clear;
ops->putcs = bit_putcs;
ops->clear_margins = bit_clear_margins;
ops->cursor = bit_cursor;
ops->update_start = bit_update_start;
ops->rotate_font = NULL;

if (ops->rotate)
fbcon_set_rotate(ops);
}



8020a5e0 :
8020a5e0:   55  push   %rbp
8020a5e1:   bf 01 00 00 00  mov$0x1,%edi
8020a5e6:   48 89 e5mov%rsp,%rbp
8020a5e9:   e8 c2 fd 35 00  callq  8056a3b0

8020a5ee:   65 48 8b 04 25 10 00mov%gs:0x10,%rax
8020a5f5:   00 00 
8020a5f7:   48 2d c8 1f 00 00   sub$0x1fc8,%rax
8020a5fd:   f0 0f ba 28 10  lock btsl $0x10,(%rax)
8020a602:   19 d2   sbb%edx,%edx
8020a604:   85 d2   test   %edx,%edx
8020a606:   75 0a   jne8020a612

8020a608:   0f 20 e0mov%cr4,%rax
8020a60b:   48 83 c8 04 or $0x4,%rax
8020a60f:   0f 22 e0mov%rax,%cr4
8020a612:   bf 01 00 00 00  mov$0x1,%edi
8020a617:   e8 e4 fc 35 00  callq  8056a300

8020a61c:   65 48 8b 04 25 10 00mov%gs:0x10,%rax
8020a623:   00 00 
8020a625:   f6 80 38 e0 ff ff 08testb  $0x8,-0x1fc8(%rax)
8020a62c:   75 02   jne8020a630

8020a62e:   c9  leaveq 
8020a62f:   c3  retq   
8020a630:   e8 2b 99 35 00  callq  80563f60

8020a635:   c9  leaveq 
8020a636:   66 90   xchg   %ax,%ax
8020a638:   c3  retq   

==
/arch/x86/kernel/io_delay.c

void native_io_delay(void)
{
switch (io_delay_type) {
default:
case CONFIG_IO_DELAY_TYPE_0X80:
asm volatile ("outb %al, $0x80");
break;
case CONFIG_IO_DELAY_TYPE_0XED:
asm volatile ("outb %al, $0xed");
break;
case CONFIG_IO_DELAY_TYPE_UDELAY:
/*
 * 2 usecs is an upper-bound for the outb delay but
 * note that udelay doesn't have the bus-level
 * side-effects that outb does, nor does udelay() have
 * precise timings during very early bootup (the delays
 * are shorter until calibrated):
 */
udelay(2);
case CONFIG_IO_DELAY_TYPE_NONE:
break;
}
}
EXPORT_SYMBOL(native_io_delay);

802131e0 :
802131e0:   55  push   %rbp
802131e1:   8b 05 3d b3 54 00   mov0x54b33d(%rip),%eax 
  # 8075e524 
802131e7:   48 89 e5mov%rsp,%rbp
802131ea:   83 f8 02cmp$0x2,%eax
802131ed:   74 29   je 80213218

802131ef:   83 f8 03cmp$0x3,%eax
802131f2:   74 06   je 802131fa

802131f4:   ff c8   dec%eax
802131f6:   74 10   je 80213208

802131f8:   e6 80   out%al,$0x80
802131fa:   c9  leaveq 
802131fb:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
80213200:   c3  retq   
80213201:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
80213208:   e6 ed   out%al,$0xed
8021320a:   c9  leaveq 
8021320b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
80213210:   c3  retq   
80213211:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
80213218:   bf 8e 21 00 00  mov$0x218e,%edi
8021321d:   0f 1f 00nopl   (%rax)
80213220:   e8 fb ac 1e 00  callq  803fdf20
<__const_udelay>
80213225:   c9  leaveq 
80213226:   66 90   xchg   %ax,%ax
80213228:   c3  retq   

===
arch/x86/mm/ioremap.c

int ioremap_change_attr(unsigned long vaddr, unsigned long size,
   unsig

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #2 from vvv at ru dot ru  2009-04-28 17:04 ---
Created an attachment (id=17776)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776&action=view)
Source file from Linx Kernel 2.6.29.1

See static void set_blitting_type


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread pinskia at gcc dot gnu dot org


--- Comment #1 from pinskia at gcc dot gnu dot org  2009-04-28 13:42 ---
Can you provide the preprocessed source which contains set_blitting_type?


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
  Component|c   |target
 GCC target triplet||x86_64-linux-gnu
   Keywords||missed-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942