[Bug c/40363] New: Nonoptimal save/restore registers

2009-06-06 Thread vvv at ru dot ru
IMHO, current save/restore registers strategy is not optimal. Look:

# cat test.c

#include stdio.h
void print(char *mess, char *format, int text)
{
printf(mess);
printf(format,text);
}
void main()
{
print(X=,%d\n,1);
}

# gcc --version
gcc (GCC) 4.5.0 20090601 (experimental)
# gcc -o test test.c -O2
# objdump -d test

004004d0 print:
  4004d0:   48 89 5c 24 f0  mov%rbx,-0x10(%rsp)
  4004d5:   48 89 6c 24 f8  mov%rbp,-0x8(%rsp)
  4004da:   48 89 f3mov%rsi,%rbx
  4004dd:   48 83 ec 18 sub$0x18,%rsp
  4004e1:   89 d5   mov%edx,%ebp
  4004e3:   31 c0   xor%eax,%eax
  4004e5:   e8 ce fe ff ff  callq  4003b8 pri...@plt
  4004ea:   89 ee   mov%ebp,%esi
  4004ec:   48 89 dfmov%rbx,%rdi
  4004ef:   48 8b 6c 24 10  mov0x10(%rsp),%rbp
  4004f4:   48 8b 5c 24 08  mov0x8(%rsp),%rbx
  4004f9:   31 c0   xor%eax,%eax
  4004fb:   48 83 c4 18 add$0x18,%rsp
  4004ff:   e9 b4 fe ff ff  jmpq   4003b8 pri...@plt

=

Let's replace current save/restore:

48 89 5c 24 f0  mov%rbx,-0x10(%rsp)
48 89 6c 24 f8  mov%rbp,-0x8(%rsp)
48 83 ec 18 sub$0x18,%rsp
...
48 8b 6c 24 10  mov0x10(%rsp),%rbp
48 8b 5c 24 08  mov0x8(%rsp),%rbx
48 83 c4 18 add$0x18,%rsp

to faster and short new save/restore:

55  push   %rbp
53  push   %rbx
53  push   %rbx ; dummy push
...
5b  pop%rbx ; dummy pop
5b  pop%rbx
5d  pop%rbp

IMPOTANT note: For faster execution, dummy push have to use same register as
previous push!

Measurement results on Core2: new save/restore 5 ticks faster then carrent one.

Regards,
 Vladimir Volynsky


-- 
   Summary: Nonoptimal save/restore registers
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40363



[Bug target/40171] GCC does not pass -mtune and -march options to assembler!

2009-05-25 Thread vvv at ru dot ru


--- Comment #4 from vvv at ru dot ru  2009-05-25 19:54 ---
(In reply to comment #2)
 This is very odd?  What is the assembler doing that the compiler isn't?

There are exist some optimizations impossible without exact knowledge of
address and opcodes,
One example avoiding of branch mispredicts -
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
Other example - Ensure instructions using 0xF7 opcode byte does not start at
offset 14 of a fetch line...

Unfortunately, current version GNU AS cat't do this optimizations.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40171



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-20 Thread vvv at ru dot ru


--- Comment #49 from vvv at ru dot ru  2009-05-20 21:38 ---
(In reply to comment #48)
How this patches work? Is it required some special options?

# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
if (i == 1) F(1);
if (i == 2) F(2);
if (i == 3) F(3);
if (i == 4) F(4);
if (i == 5) F(5);
}
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# /media/disk-1/B/bin/gcc -o t test.c -O2 -c -mtune=k8
# objdump -d t
 f:
   0:   83 ff 01cmp$0x1,%edi
   3:   74 1b   je 20 f+0x20
   5:   83 ff 02cmp$0x2,%edi
   8:   74 16   je 20 f+0x20
   a:   83 ff 03cmp$0x3,%edi
   d:   74 11   je 20 f+0x20
   f:   83 ff 04cmp$0x4,%edi
  12:   74 0c   je 20 f+0x20
  14:   83 ff 05cmp$0x5,%edi
  17:   74 07   je 20 f+0x20
  19:   f3 c3   repz retq 
  1b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  20:   31 c0   xor%eax,%eax
  22:   e9 00 00 00 00  jmpq   27 f+0x27
  27:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
  2e:   00 00 

0030 func:
  30:   48 83 ec 08 sub$0x8,%rsp
  34:   e8 00 00 00 00  callq  39 func+0x9
  39:   85 c0   test   %eax,%eax
  3b:   89 c7   mov%eax,%edi
  3d:   74 0e   je 4d func+0x1d
  3f:   90  nop
  40:   e8 00 00 00 00  callq  45 func+0x15
  45:   8d 7c 40 01 lea0x1(%rax,%rax,2),%edi
  49:   85 ff   test   %edi,%edi
  4b:   75 f3   jne40 func+0x10
  4d:   48 83 c4 08 add$0x8,%rsp
  51:   c3  retq   

I can't see any padding in function f :(

PS. In file config/i386/i386.c (ix86_avoid_jump_mispredicts)

  /* Look for all minimal intervals of instructions containing 4 jumps.
...

Not jumps, but _branches_ (CALL, JMP, conditional branches, or returns) 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug c/40171] New: GCC does not pass -mtune and -march options to assembler!

2009-05-16 Thread vvv at ru dot ru
GNU Assembler support optimization options, but GCC does not pass -mtune and
-march options to assembler! For full optimization it's required to use this
twice:

# gcc ... -mtune=core2 -Wa,-mtune=core2

There is no default passing optimization options from GCC to AS. But many
programmers imply that passing. Because it's very strange to optimize code on
GCC-level and do not optimize on assembler level.

Even Linux kernel use -march without -Wa,-march.

PS. CCing to v...@ru.ru, please.


-- 
   Summary: GCC does not pass -mtune and -march options to
assembler!
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40171



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread vvv at ru dot ru


--- Comment #30 from vvv at ru dot ru  2009-05-14 09:01 ---
Created an attachment (id=17863)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863action=view)
Testing tool.

Here is results of my testing.
Code:
align   128
test_cikl:
rept 14 ; 14 if SH=0, 15 if SH=1, 16 if SH=2
{
nop
}
 cmp  al,0   ; 2 bytes

 jz   $+10h+NOPS ; 2 bytes offset=0
 cmp  al,1   ; 2 bytes offset=2
 jz   $+0Ch+NOPS ; 2 bytes offset=4
 cmp  al,2   ; 2 bytes offset=6
 jz   $+08h+NOPS ; 2 bytes offset=8
 cmp  al,3   ; 2 bytes offset=A
match =1, NOPS
{
   nop
}
match =2, NOPS
{
   xchg eax,eax ; 2-bytes NOP
}
 jz   $+04h  ; 2 bytes offset=C
 ja   $+02h  ; 2 bytes offset=E

 mov  eax,ecx
 and  eax,7h
 loop test_cikl

This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks.

; Core 2 Duo
;NOPS/tick/Max  NOPS/tick/MaxNOPS/tick/Max
; SH=0  0/571/729  1/306/594   2/315/630
; SH=1  0/338/612  1/338/648   2/339/648
; SH=2  0/339/666  1/339/675   2/333/693

; Xeon 3110
;NOPS/tick/Max  NOPS/tick/MaxNOPS/tick/Max
; SH=0  0/586/693  1/310/675   2/310/675
; SH=1  0/333/657  1/330/648   2/464/630
; SH=2  0/333/657  1/470/594   2/474/603

; P4
;NOPS/tick/Max  NOPS/tick/MaxNOPS/tick/Max
; SH=0 0/1027/1317 1/1094/1258 2/1028/1207
; SH=1 0/1151/1377 1/1068/1352 2/902/1275
; SH=2 0/1124/1275 1/1148/1335 2/979/1139

Conclusion:
1. Core2 and Xeon - similar results. P4 - something strange.
For Core2  Xeon padding very effective. Code with padding almoust 2 times
faster. No sence for P4?
2. My previous sentence

VVV 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but
VVV Intel limitation for 16-bytes chunk  (memory range  - +10h)

is wrong. At leat for Core2  Xeon. For this CPU 16-bytes chunk means
memory range XXX0 - XXXF.

Unfortunately, I can't test AMD.

PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode,
switch to 64-bit mode and measure rdtsc ticks for test code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-14 Thread vvv at ru dot ru


--- Comment #34 from vvv at ru dot ru  2009-05-14 19:43 ---
(In reply to comment #32)
 Please make sure that you only test nop paddings for branch insns,
 not nop paddings for branch targets, which prefer 16byte alignment.

Additional tests (for Core2) results:
1. Execution time don't depend on paddings for branch target.
2. Execution time don't depend on position of NOP within 16-byte chunk with 4
branch. Even if NOP inserted between CMP and conditional jump.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #19 from vvv at ru dot ru  2009-05-13 11:42 ---
(In reply to comment #18)
 No, .p2align is the right thing to do, given that GCC doesn't have 100%
 accurate information about instruction sizes (for e.g. inline asms it can't
 have, for
 stuff where branch shortening can decrease the size doesn't have it until the
 shortening branch phase which is too late for this machine reorg, and in other
 cases the lengths are just upper bounds).  Say .p2align 16,,5 says
 insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
 otherwise don't insert anything.  But that necessarily means that there were
 less than 11 bytes in the same 16 byte page and if the lower bound insn size
 estimation determined that in 11 bytes you can't have 3 branch changing
 instructions, you are fine.  Breaking of fused compare and jump (32-bit code
 only) is unfortunate, but inserting it before the cmp would mean often
 unnecessarily large padding.

You are rigth, if padding required for every 16-byte page with 4 branches on
it. But Intel writes about 16-byte chunk, not 16-byte page.

Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual:

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put
more than four branches in a 16-byte chunk.

IMHO, here chunk - memory range from x to x+10h, where x - _any_ address. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #21 from vvv at ru dot ru  2009-05-13 17:13 ---
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:

1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF), but
Intel limitation for 16-bytes chunk  (memory range  - +10h)
2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or
returns),
Intel -  maximum of _FOUR_ branches!

Quotation from Software Optimization Guide for AMD64 Processors

6.1  Density of Branches
When possible, align branches such that they do not cross a 16-byte boundary.

The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache
branch-prediction history for a maximum of three near branches (CALL, JMP,
conditional branches, or returns) per 16-byte fetch window. A branch
instruction that crosses a 16-byte boundary is counted in the second 16-byte
window. Due to architectural restrictions, a branch that is split across a
16-byte
boundary cannot dispatch with any other instructions when it is predicted
taken. Perform this alignment by rearranging code; it is not beneficial to
align branches using padding sequences.

The following branches are limited to three per 16-byte window:

jcc   rel8
jcc   rel32
jmp   rel8
jmp   rel32
jmp   reg
jmp   WORD PTR
jmp   DWORD PTR
call  rel16
call  r/m16
call  rel32
call  r/m32

Coding more than three branches in the same 16-byte code window may lead to
conflicts in the branch target buffer. To avoid conflicts in the branch target
buffer, space out branches such that three or fewer exist in a given 16-byte
code window. For absolute optimal performance, try to limit branches to one per
16-byte code window. Avoid code sequences like the following:
ALIGN 16
label3:
 call   label1 ;  1st branch in 16-byte code   window
 jc label3 ;  2nd branch in 16-byte code   window
 call   label2 ;  3rd branch in 16-byte code   window
 jnzlabel4 ;  4th branch in 16-byte code   window
   ;  Cannot be predicted.
If there is a jump table that contains many frequently executed branches, pad
the table entries to 8 bytes each to assure that there are never more than
three branches per 16-byte block of code.
Only branches that have been taken at least once are entered into the dynamic
branch prediction, and therefore only those branches count toward the
three-branch limit.




-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #25 from vvv at ru dot ru  2009-05-13 18:56 ---
(In reply to comment #22)
 CCing H.J for Intel optimization issues.

VVV 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),
but
VVV Intel limitation for 16-bytes chunk  (memory range  -
+10h)

I have a doubt about this now. Sanks to Richard Guenther (Comment #20). So I am
going to make measurements for check it for Core2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #26 from vvv at ru dot ru  2009-05-13 19:05 ---
(In reply to comment #23)
 Note that we need something that works for the generic model as well, which in
 this case looks like it is the same as for AMD models.

There is processor property TARGET_FOUR_JUMP_LIMIT, may be create new one -
TARGET_FIVE_JUMP_LIMIT? 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-13 Thread vvv at ru dot ru


--- Comment #28 from vvv at ru dot ru  2009-05-13 19:18 ---
(In reply to comment #24)
 Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
 it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16 byte chunk - is very-very infrequent. Especially for 64-bit
mode.

BTW, it's difficult to understand, what Intel mean ander term branch. Is it
CALL, JMP, conditional branches, or returns (same as AMD), or only JMP and
conditional branches. I beleave last case right.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-05-12 Thread vvv at ru dot ru


--- Comment #17 from vvv at ru dot ru  2009-05-12 16:40 ---
(In reply to comment #16)
 Created an attachment (id=17783)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783action=view) [edit]
 gcc45-pr39942.patch
 Patch that attempts to take into account .p2align directives that are emitted
 for (some) CODE_LABELs and also the gen_align insns that the pass itself
 inserts.  For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
 directive starts a new 16 byte page (then insns before it are never
 interesting), or nothing was skipped because more than 10 bytes would need to
 be skipped.  But that means the current group could contain only 5 or less
 bytes of instructions before the label, so again, we don't have to look at
 instructions not in the last 5 bytes.
 Another fix is that for MAX_SKIP  7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
 the second .p2align 3, which might (and often does) skip more than MAX_SKIP
 bytes (up to 7).

Nice path. Code looks better. It checked on Linux kernel 2.6.29.2.
But 2 notes:

1.There is no garanty that .p2align will be translated to NOPs. Example:

# cat test.c
void f(int i)
{
if (i == 1) F(1);
if (i == 2) F(2);
if (i == 3) F(3);
if (i == 4) F(4);
if (i == 5) F(5);
}
# gcc -o test.s test.c -O2 -S
# cat test.s
.file   test.c
.text
.p2align 4,,15
.globl f
.type   f, @function
f:
.LFB0:
.cfi_startproc
cmpl$1, %edi
je  .L7
cmpl$2, %edi
je  .L7
cmpl$3, %edi
je  .L7
cmpl$4, %edi
.p2align 4,,5--- attempt of padding
je  .L7
cmpl$5, %edi
je  .L7
rep
ret
.p2align 4,,10
.p2align 3
.L7:
xorl%eax, %eax
jmp F
.cfi_endproc
.LFE0:
.size   f, .-f
.ident  GCC: (GNU) 4.5.0 20090512 (experimental)
.section.note.GNU-stack,,@progbits

# gcc -o test.out test.s -O2 -c
# objdump -d test.out
 f:
   0:   83 ff 01cmp$0x1,%edi
   3:   74 1b   je 20 f+0x20
   5:   83 ff 02cmp$0x2,%edi
   8:   74 16   je 20 f+0x20
   a:   83 ff 03cmp$0x3,%edi
   d:   74 11   je 20 f+0x20
   f:   83 ff 04cmp$0x4,%edi
  12:   74 0c   je 20 f+0x20   no NOP here 
  14:   83 ff 05cmp$0x5,%edi
  17:   74 07   je 20 f+0x20
  19:   f3 c3   repz retq 

IMHO, better to insert not .p2align, but NOPs directly. ( I mean line -
emit_insn_before (gen_align (GEN_INT (padsize)), insn); )

2. IMHO, it's bad idea to insert somthing between CMP and conditional jmp.
Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual

 3.4.2.2   Optimizing for Macro-fusion
 Macro-fusion merges two instructions to a single μop. Intel Core 
 Microarchitecture
 performs this hardware optimization under limited circumstances.
 The first instruction of the macro-fused pair must be a CMP or TEST 
 instruction. This
 instruction can be REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The
 second instruction (adjacent in the instruction stream) should be a 
 conditional
 branch.

So if we need to insert NOPs, better to do it _before_ CMP.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug c/40093] New: Optimization by functios reordering.

2009-05-10 Thread vvv at ru dot ru
Because memory controller prefetch memory blocks, execution time of functions
calls sequence depend on order this functions in memory. For example:
4 calls:

call func1
call func2
call func3
call func4

faster in case of direct functions order in memmory:
.p2align 4
func1:
  ret
.p2align 4
func2:
  ret
.p2align 4
func3:
  ret
.p2align 4
func4:
  ret

and slow in case inverse order:
.p2align 4
func4:
  ret
.p2align 4
func3:
  ret
.p2align 4
func2:
  ret
.p2align 4
func1:
  ret

Unfortunately, inverse order is typical for C/C++.
what do you think about this kind optimization?


-- 
   Summary: Optimization by functios reordering.
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093



[Bug c/40093] Optimization by functios reordering.

2009-05-10 Thread vvv at ru dot ru


--- Comment #1 from vvv at ru dot ru  2009-05-10 16:43 ---
Created an attachment (id=17847)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17847action=view)
Example direct/inverse calls

Simple example. RDTSC ticks for direct and inverse sequence of calls.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093



[Bug middle-end/40093] Optimization by functios reordering.

2009-05-10 Thread vvv at ru dot ru


--- Comment #3 from vvv at ru dot ru  2009-05-10 18:08 ---
(In reply to comment #2)
 This should have been done already with cgraph order.

Unfortunately, I can see inverse order only in separate source file. Inverse
but not optimized.
Example:
// file order1.c 
#include stdio.h
main(int argc, char **argv)
{int i,j,k,l;
i=func1();
j=func2();
k=func3();
l=func4();
printf(%d %d %d %d\n,i,j,k,l);
}
=
// file order2.c 
int func1(){ return(F(4));}
int func2(){ return(F(3));}
int func3(){ return(F(2));}
int func4(){ return(F(1));}
=
// file order3.c 
int F(int x){ return(x);}

# gcc --version
gcc (GCC) 4.5.0 20090508 (experimental)
# gcc -o order order3.c order2.c order1.c -O2
# objdump -d order

00400520 F:
  400520:   89 f8   mov%edi,%eax
  400522:   c3  retq   

00400530 func4:
  400530:   bf 01 00 00 00  mov$0x1,%edi
  400535:   31 c0   xor%eax,%eax
  400537:   e9 e4 ff ff ff  jmpq   400520 F

00400540 func3:
  400540:   bf 02 00 00 00  mov$0x2,%edi
  400545:   31 c0   xor%eax,%eax
  400547:   e9 d4 ff ff ff  jmpq   400520 F

00400550 func2:
  400550:   bf 03 00 00 00  mov$0x3,%edi
  400555:   31 c0   xor%eax,%eax
  400557:   e9 c4 ff ff ff  jmpq   400520 F

00400560 func1:
  400560:   bf 04 00 00 00  mov$0x4,%edi
  400565:   31 c0   xor%eax,%eax
  400567:   e9 b4 ff ff ff  jmpq   400520 F

00400570 main:
  400570:   48 89 5c 24 e8  mov%rbx,-0x18(%rsp)
  400575:   48 89 6c 24 f0  mov%rbp,-0x10(%rsp)
  40057a:   31 c0   xor%eax,%eax
  40057c:   4c 89 64 24 f8  mov%r12,-0x8(%rsp)
  400581:   48 83 ec 18 sub$0x18,%rsp
  400585:   e8 d6 ff ff ff  callq  400560 func1
  40058a:   89 c3   mov%eax,%ebx
  40058c:   31 c0   xor%eax,%eax
  40058e:   e8 bd ff ff ff  callq  400550 func2
  400593:   89 c5   mov%eax,%ebp
  400595:   31 c0   xor%eax,%eax
  400597:   e8 a4 ff ff ff  callq  400540 func3
  40059c:   41 89 c4mov%eax,%r12d
  40059f:   31 c0   xor%eax,%eax
  4005a1:   e8 8a ff ff ff  callq  400530 func4
  4005a6:   44 89 e1mov%r12d,%ecx
  4005a9:   41 89 c0mov%eax,%r8d
  4005ac:   89 ea   mov%ebp,%edx
  4005ae:   89 de   mov%ebx,%esi
  4005b0:   48 8b 6c 24 08  mov0x8(%rsp),%rbp
  4005b5:   48 8b 1c 24 mov(%rsp),%rbx
  4005b9:   4c 8b 64 24 10  mov0x10(%rsp),%r12
  4005be:   bf bc 06 40 00  mov$0x4006bc,%edi
  4005c3:   31 c0   xor%eax,%eax
  4005c5:   48 83 c4 18 add$0x18,%rsp
  4005c9:   e9 42 fe ff ff  jmpq   400410 pri...@plt
=

But optimal:

00400520 main:
  400520:   48 89 5c 24 e8  mov%rbx,-0x18(%rsp)
  400525:   48 89 6c 24 f0  mov%rbp,-0x10(%rsp)
  40052a:   31 c0   xor%eax,%eax
  40052c:   4c 89 64 24 f8  mov%r12,-0x8(%rsp)
  400531:   48 83 ec 18 sub$0x18,%rsp
  400535:   e8 46 00 00 00  callq  400580 func1
  40053a:   89 c3   mov%eax,%ebx
  40053c:   31 c0   xor%eax,%eax
  40053e:   e8 4d 00 00 00  callq  400590 func2
  400543:   89 c5   mov%eax,%ebp
  400545:   31 c0   xor%eax,%eax
  400547:   e8 54 00 00 00  callq  4005a0 func3
  40054c:   41 89 c4mov%eax,%r12d
  40054f:   31 c0   xor%eax,%eax
  400551:   e8 5a 00 00 00  callq  4005b0 func4
  400556:   44 89 e1mov%r12d,%ecx
  400559:   41 89 c0mov%eax,%r8d
  40055c:   89 ea   mov%ebp,%edx
  40055e:   89 de   mov%ebx,%esi
  400560:   48 8b 6c 24 08  mov0x8(%rsp),%rbp
  400565:   48 8b 1c 24 mov(%rsp),%rbx
  400569:   4c 8b 64 24 10  mov0x10(%rsp),%r12
  40056e:   bf bc 06 40 00  mov$0x4006bc,%edi
  400573:   31 c0   xor%eax,%eax
  400575:   48 83 c4 18 add$0x18,%rsp
  400579:   e9 92 fe ff ff  jmpq   400410 pri...@plt

00400580 func1:
  400580:   bf 01 00 00 00  mov$0x1,%edi
  400585:   31 c0

[Bug middle-end/40093] Optimization by functios reordering.

2009-05-10 Thread vvv at ru dot ru


--- Comment #5 from vvv at ru dot ru  2009-05-10 18:20 ---
(In reply to comment #4)
 Well you need whole program to get the behavior which you want.

Yes. Of course, it's no problem for small single-programmer project, but it's
problem for big projects like Linux Kernel.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40093



[Bug c/40072] New: Nonoptimal code - CMOVxx %eax,%edi; mov %edi,%eax; retq

2009-05-08 Thread vvv at ru dot ru
Sometimes GCC generate code at end of function:

 cmovge %eax,%edi
 mov%edi,%eax
 retq   

but faster:

 cmovl %edi,%eax
 retq   

Example:

# cat test.c

#define MX 0
#define LIM 7

char char_char(char m)
{if(mLIM) return(MX); return(m);}

char char_int(int m)
{if(mLIM) return(MX); return(m);}

char char_uint(unsigned int m)
{if(mLIM) return(MX); return(m);}

char char_long(long m)
{if(mLIM) return(MX); return(m);}

char char_ulong(unsigned long m)
{if(mLIM) return(MX); return(m);}


int int_char(char m)
{if(mLIM) return(MX); return(m);}

int int_int(int m)
{if(mLIM) return(MX); return(m);}  // Nonoptimal 
int int_uint(unsigned int m)
{if(mLIM) return(MX); return(m);}

int int_long(long m)
{if(mLIM) return(MX); return(m);}

int int_ulong(unsigned long m)
{if(mLIM) return(MX); return(m);}



unsigned int uint_char(char m)
{if(mLIM) return(MX); return(m);}

unsigned int uint_int(int m)
{if(mLIM) return(MX); return(m);}

unsigned int uint_uint(unsigned int m)  //Nonoptimal
{if(mLIM) return(MX); return(m);}

unsigned int uint_long(long m)
{if(mLIM) return(MX); return(m);}

unsigned int uint_ulong(unsigned long m)
{if(mLIM) return(MX); return(m);}



long long_char(char m)
{if(mLIM) return(MX); return(m);}

long long_int(int m)
{if(mLIM) return(MX); return(m);}

long long_uint(unsigned int m)
{if(mLIM) return(MX); return(m);}

long long_long(long m)  //Nonoptimal
{if(mLIM) return(MX); return(m);}

long long_ulong(unsigned long m)
{if(mLIM) return(MX); return(m);}




unsigned long ulong_char(char m)
{if(mLIM) return(MX); return(m);}

unsigned long ulong_int(int m)
{if(mLIM) return(MX); return(m);}

unsigned long ulong_uint(unsigned int m)
{if(mLIM) return(MX); return(m);}

unsigned long ulong_long(long m)
{if(mLIM) return(MX); return(m);}

unsigned long ulong_ulong(unsigned long m)  //Nonoptimal
{if(mLIM) return(MX); return(m);}

# gcc -o t test.c -O2 -c
# objdump -d t

t: file format elf64-x86-64


Disassembly of section .text:

 char_char:
   0:   89 f8   mov%edi,%eax
   2:   40 80 ff 08 cmp$0x8,%dil
   6:   ba 00 00 00 00  mov$0x0,%edx
   b:   0f 4d c2cmovge %edx,%eax--- It's ok! Optimal
   e:   c3  retq   
   f:   90  nop

skip...

0060 int_int:
  60:   83 ff 08cmp$0x8,%edi
  63:   b8 00 00 00 00  mov$0x0,%eax
  68:   0f 4d f8cmovge %eax,%edi--- Nonoptimal
  6b:   89 f8   mov%edi,%eax--- Nonoptimal
  6d:   c3  retq   
  6e:   66 90   xchg   %ax,%ax

skip...

00c0 uint_uint:
  c0:   83 ff 08cmp$0x8,%edi
  c3:   b8 00 00 00 00  mov$0x0,%eax
  c8:   0f 43 f8cmovae %eax,%edi--- Nonoptimal
  cb:   89 f8   mov%edi,%eax--- Nonoptimal
  cd:   c3  retq   
  ce:   66 90   xchg   %ax,%ax

skip...

0120 long_long:
 120:   48 83 ff 08 cmp$0x8,%rdi
 124:   b8 00 00 00 00  mov$0x0,%eax
 129:   48 0f 4d f8 cmovge %rax,%rdi--- Nonoptimal
 12d:   48 89 f8mov%rdi,%rax--- Nonoptimal
 130:   c3  retq   

skip...

0190 ulong_ulong:
 190:   48 83 ff 08 cmp$0x8,%rdi
 194:   b8 00 00 00 00  mov$0x0,%eax
 199:   48 0f 43 f8 cmovae %rax,%rdi--- Nonoptimal
 19d:   48 89 f8mov%rdi,%rax--- Nonoptimal
 1a0:   c3  retq


-- 
   Summary: Nonoptimal code -  CMOVxx %eax,%edi; mov%edi,%eax;
retq
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40072



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru


--- Comment #11 from vvv at ru dot ru  2009-04-29 07:46 ---
(In reply to comment #8)
 From config/i386/i386.c:
 /* AMD Athlon works faster
when RET is not destination of conditional jump or directly preceded
by other jump instruction.  We avoid the penalty by inserting NOP just
before the RET instructions in such cases.  */
 static void
 ix86_pad_returns (void)
 ...

But I am using Core 2 Duo.
Why we see multibyte nop, not single byte nop?
Why if change line u = F(u)*3+1; to u = F(u)*4+1; or u = F(u); number of nops
changed?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru


--- Comment #12 from vvv at ru dot ru  2009-04-29 07:55 ---
(In reply to comment #9)
 So that explains it, Use -Os or attribute cold if you want NOPs to be gone.

But my measurements on Core 2 Duo P8600 show that

push %ebp
mov  %esp,%ebp
leave
ret

_faster_ then

push %ebp
mov  %esp,%ebp
leave
xchg %ax,%ax
ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-29 Thread vvv at ru dot ru


--- Comment #15 from vvv at ru dot ru  2009-04-29 19:16 ---
One more example 5-bytes nop between leaveq and retq.

# cat test.c

void wait_for_enter()
{
int u = getchar();
while (!u)
u = getchar()-13;
}
main()
{
wait_for_enter();
}

# gcc -o t.out test.c -O2 -march=core2 -fno-omit-frame-pointer
# objdump -d t.out
...
00400540 wait_for_enter:
  400540:   55  push   %rbp
  400541:   31 c0   xor%eax,%eax
  400543:   48 89 e5mov%rsp,%rbp
  400546:   e8 f5 fe ff ff  callq  400440 getc...@plt
  40054b:   85 c0   test   %eax,%eax
  40054d:   75 13   jne400562 wait_for_enter+0x22
  40054f:   90  nop
  400550:   31 c0   xor%eax,%eax
  400552:   e8 e9 fe ff ff  callq  400440 getc...@plt
  400557:   83 f8 0dcmp$0xd,%eax
  40055a:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)
  400560:   74 ee   je 400550 wait_for_enter+0x10
  400562:   c9  leaveq 
  400563:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)  --NONOPTIMAL!
  400568:   c3  retq   
  400569:   0f 1f 80 00 00 00 00nopl   0x0(%rax)

00400570 main:
  400570:   55  push   %rbp
  400571:   31 c0   xor%eax,%eax
  400573:   48 89 e5mov%rsp,%rbp
  400576:   e8 c5 ff ff ff  callq  400540 wait_for_enter
  40057b:   c9  leaveq 
  40057c:   c3  retq   
  40057d:   90  nop
  40057e:   90  nop
  40057f:   90  nop

So bug unresolved.


-- 

vvv at ru dot ru changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru
Sometimes we can see 2 bytes nop (xchg %ax,%ax) between leaveq and retq.
IMHO, better to remove xchg %ax,%ax

Examples from Kernel 2.6.29.1:

 gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]
 objdump vmlinux
...
804262e0 set_blitting_type:
804262e0:   55  push   %rbp
804262e1:   0f b7 07movzwl (%rdi),%eax
804262e4:   4c 8b 86 d0 03 00 00mov0x3d0(%rsi),%r8
804262eb:   48 c1 e0 07 shl$0x7,%rax
804262ef:   48 89 e5mov%rsp,%rbp
804262f2:   48 05 40 1f 9c 80   add$0x809c1f40,%rax
804262f8:   49 89 80 90 01 00 00mov%rax,0x190(%r8)
804262ff:   8b 46 04mov0x4(%rsi),%eax
80426302:   89 c1   mov%eax,%ecx
80426304:   81 e1 00 00 02 00   and$0x2,%ecx
8042630a:   75 2c   jne80426338
set_blitting_type+0x58
8042630c:   48 8b 86 d0 03 00 00mov0x3d0(%rsi),%rax
80426313:   4c 89 c7mov%r8,%rdi
80426316:   48 8b 90 90 01 00 00mov0x190(%rax),%rdx
8042631d:   8b 52 1cmov0x1c(%rdx),%edx
80426320:   83 fa 03cmp$0x3,%edx
80426323:   0f 4e cacmovle %edx,%ecx
80426326:   89 88 b0 01 00 00   mov%ecx,0x1b0(%rax)
8042632c:   e8 2f 4d 00 00  callq  8042b060
fbcon_set_bitops
80426331:   c9  leaveq 
80426332:   c3  retq   
80426333:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
80426338:   e8 23 61 00 00  callq  8042c460
fbcon_set_tileops
8042633d:   c9  leaveq 
8042633e:   66 90   xchg   %ax,%ax
80426340:   c3  retq   

...
...

8042b060 fbcon_set_bitops:
8042b060:   55  push   %rbp
8042b061:   48 c7 07 d0 ad 42 80movq  
$0x8042add0,(%rdi)
8042b068:   8b 87 b0 01 00 00   mov0x1b0(%rdi),%eax
8042b06e:   48 89 e5mov%rsp,%rbp
8042b071:   48 c7 47 08 30 ae 42movq  
$0x8042ae30,0x8(%rdi)
8042b078:   80 
8042b079:   48 c7 47 10 c0 b7 42movq  
$0x8042b7c0,0x10(%rdi)
8042b080:   80 
8042b081:   48 c7 47 18 10 af 42movq  
$0x8042af10,0x18(%rdi)
8042b088:   80 
8042b089:   48 c7 47 20 10 b1 42movq  
$0x8042b110,0x20(%rdi)
8042b090:   80 
8042b091:   48 c7 47 28 c0 b0 42movq  
$0x8042b0c0,0x28(%rdi)
8042b098:   80 
8042b099:   48 c7 47 30 00 00 00movq   $0x0,0x30(%rdi)
8042b0a0:   00 
8042b0a1:   85 c0   test   %eax,%eax
8042b0a3:   75 0b   jne8042b0b0
fbcon_set_bitops+0x50
8042b0a5:   c9  leaveq 
8042b0a6:   c3  retq   
8042b0a7:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
8042b0ae:   00 00 
8042b0b0:   e8 4b 15 00 00  callq  8042c600
fbcon_set_rotate
8042b0b5:   c9  leaveq 
8042b0b6:   66 90   xchg   %ax,%ax
8042b0b8:   c3  retq


-- 
   Summary: Nonoptimal code - leaveq; xchg   %ax,%ax; retq
   Product: gcc
   Version: 4.3.2
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #2 from vvv at ru dot ru  2009-04-28 17:04 ---
Created an attachment (id=17776)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776action=view)
Source file from Linx Kernel 2.6.29.1

See static void set_blitting_type


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #3 from vvv at ru dot ru  2009-04-28 17:10 ---
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!) 
=
linux/drivers/video/console/bitblit.c

void fbcon_set_bitops(struct fbcon_ops *ops)
{
ops-bmove = bit_bmove;
ops-clear = bit_clear;
ops-putcs = bit_putcs;
ops-clear_margins = bit_clear_margins;
ops-cursor = bit_cursor;
ops-update_start = bit_update_start;
ops-rotate_font = NULL;

if (ops-rotate)
fbcon_set_rotate(ops);
}



8020a5e0 disable_TSC:
8020a5e0:   55  push   %rbp
8020a5e1:   bf 01 00 00 00  mov$0x1,%edi
8020a5e6:   48 89 e5mov%rsp,%rbp
8020a5e9:   e8 c2 fd 35 00  callq  8056a3b0
add_preempt_count
8020a5ee:   65 48 8b 04 25 10 00mov%gs:0x10,%rax
8020a5f5:   00 00 
8020a5f7:   48 2d c8 1f 00 00   sub$0x1fc8,%rax
8020a5fd:   f0 0f ba 28 10  lock btsl $0x10,(%rax)
8020a602:   19 d2   sbb%edx,%edx
8020a604:   85 d2   test   %edx,%edx
8020a606:   75 0a   jne8020a612
disable_TSC+0x32
8020a608:   0f 20 e0mov%cr4,%rax
8020a60b:   48 83 c8 04 or $0x4,%rax
8020a60f:   0f 22 e0mov%rax,%cr4
8020a612:   bf 01 00 00 00  mov$0x1,%edi
8020a617:   e8 e4 fc 35 00  callq  8056a300
sub_preempt_count
8020a61c:   65 48 8b 04 25 10 00mov%gs:0x10,%rax
8020a623:   00 00 
8020a625:   f6 80 38 e0 ff ff 08testb  $0x8,-0x1fc8(%rax)
8020a62c:   75 02   jne8020a630
disable_TSC+0x50
8020a62e:   c9  leaveq 
8020a62f:   c3  retq   
8020a630:   e8 2b 99 35 00  callq  80563f60
preempt_schedule
8020a635:   c9  leaveq 
8020a636:   66 90   xchg   %ax,%ax
8020a638:   c3  retq   

==
/arch/x86/kernel/io_delay.c

void native_io_delay(void)
{
switch (io_delay_type) {
default:
case CONFIG_IO_DELAY_TYPE_0X80:
asm volatile (outb %al, $0x80);
break;
case CONFIG_IO_DELAY_TYPE_0XED:
asm volatile (outb %al, $0xed);
break;
case CONFIG_IO_DELAY_TYPE_UDELAY:
/*
 * 2 usecs is an upper-bound for the outb delay but
 * note that udelay doesn't have the bus-level
 * side-effects that outb does, nor does udelay() have
 * precise timings during very early bootup (the delays
 * are shorter until calibrated):
 */
udelay(2);
case CONFIG_IO_DELAY_TYPE_NONE:
break;
}
}
EXPORT_SYMBOL(native_io_delay);

802131e0 native_io_delay:
802131e0:   55  push   %rbp
802131e1:   8b 05 3d b3 54 00   mov0x54b33d(%rip),%eax 
  # 8075e524 io_delay_type
802131e7:   48 89 e5mov%rsp,%rbp
802131ea:   83 f8 02cmp$0x2,%eax
802131ed:   74 29   je 80213218
native_io_delay+0x38
802131ef:   83 f8 03cmp$0x3,%eax
802131f2:   74 06   je 802131fa
native_io_delay+0x1a
802131f4:   ff c8   dec%eax
802131f6:   74 10   je 80213208
native_io_delay+0x28
802131f8:   e6 80   out%al,$0x80
802131fa:   c9  leaveq 
802131fb:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
80213200:   c3  retq   
80213201:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
80213208:   e6 ed   out%al,$0xed
8021320a:   c9  leaveq 
8021320b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
80213210:   c3  retq   
80213211:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
80213218:   bf 8e 21 00 00  mov$0x218e,%edi
8021321d:   0f 1f 00nopl   (%rax)
80213220:   e8 fb ac 1e 00  callq  803fdf20
__const_udelay
80213225:   c9  leaveq 
80213226:   66 90   xchg   %ax,%ax
80213228:   c3

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #4 from vvv at ru dot ru  2009-04-28 17:15 ---
Created an attachment (id=1)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=1action=view)
Simple example from Linux

See two functons:
static void pre_schedule_rt
static void switched_from_rt


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2009-04-28 Thread vvv at ru dot ru


--- Comment #6 from vvv at ru dot ru  2009-04-28 21:18 ---
Let's compile file test.c
//#file test.c

extern int F(int m);

void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}


# gcc -o t.out test.c -c -O2
# objdump -d t.out

t.out: file format elf64-x86-64


Disassembly of section .text:

 func:
   0:   48 83 ec 08 sub$0x8,%rsp
   4:   e8 00 00 00 00  callq  9 func+0x9
   9:   85 c0   test   %eax,%eax
   b:   89 c7   mov%eax,%edi
   d:   74 0e   je 1d func+0x1d
   f:   90  nop
  10:   e8 00 00 00 00  callq  15 func+0x15
  15:   8d 7c 40 01 lea0x1(%rax,%rax,2),%edi
  19:   85 ff   test   %edi,%edi
  1b:   75 f3   jne10 func+0x10
  1d:   48 83 c4 08 add$0x8,%rsp
  21:   0f 1f 80 00 00 00 00nopl   0x0(%rax)    nonoptimal
  28:   c3  retq   


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942



[Bug c/39549] New: Nonoptimal byte load. mov (%rdi),%al better then movzbl (%rdi),%eax

2009-03-24 Thread vvv at ru dot ru
 gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]

 cat test.c
// file test.c One byte transfer

void f(char *a,char *b){
*b=*a;
}

void F(char *a,char *b){
asm volatile(mov (%rdi),%al\nmov %al,(%rsi));
}
...

 gcc -g -otest test.c -O2 -mtune=core2
 objdump -d test

004004f0 f:
  4004f0:   0f b6 07movzbl (%rdi),%eax
  4004f3:   88 06   mov%al,(%rsi)
  4004f5:   c3  retq   
  4004f6:   66 2e 0f 1f 84 00 00nopw   %cs:0x0(%rax,%rax,1)
  4004fd:   00 00 00 

00400500 F:
  400500:   8a 07   mov(%rdi),%al
  400502:   88 06   mov%al,(%rsi)
  400504:   c3  retq   

GCC use movzbl (%rdi),%eax, but better to use mov (%rdi),%al, because last
instruction 1 byte shorter. Execution time the same (at least on Core 2 Duo and
Core 2 Solo).

Probably it is result of Intel recomendations to use movz to avoid a partial
register stall. But smaller instruction reduce fetch bandwidth... and

Qwote from: Intel® 64 and IA-32 Architectures Optimization Reference Manual
248966. 3.5.2.3 Partial Register Stalls
The delay of a partial register stall is small in processors based on Intel
Core and
NetBurst microarchitectures, and in Pentium M processor (with CPUID signature
family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M
processors (CPUID signature with family 6, model 9) and the P6 family incur a
large
penalty.


-- 
   Summary: Nonoptimal byte load. mov (%rdi),%al better then movzbl
(%rdi),%eax
   Product: gcc
   Version: 4.3.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39549



[Bug c/39520] New: Empty function translated to repz retq.

2009-03-22 Thread vvv at ru dot ru
 gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]

 cat test.c
// file test.c Call to empty function
void f(){ }
int main(){ return(0);}

 gcc -o test test.c -O2
 objdump -d test

004004f0 f:
  4004f0:   f3 c3   repz retq 

Why rep ret? Why not ret?


-- 
   Summary: Empty function translated to repz retq.
   Product: gcc
   Version: 4.3.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39520